Ruzica Piskac / Michael W. Whalen (Eds.)

Piskac / Whalen (Eds.)

**METHODS IN COMPUTER-AIDED DESIGN – FMCAD 2021**

**PROCEEDINGS OF THE 21ST CONFERENCE ON FORMAL** 

# **PROCEEDINGS OF THE 21ST CONFERENCE ON FORMAL METHODS IN COMPUTER-AIDED DESIGN – FMCAD 2021**

Ruzica Piskac / Michael W. Whalen (Eds.) PROCEEDINGS OF THE 21ST CONFERENCE ON FORMAL METHODS IN COMPUTER-AIDED DESIGN – FMCAD 2021

# **Conference Series: Formal Methods in Computer-Aided Design Volume 2**

Conference Series: Formal Methods in Computer-Aided Design

Series edited by:

Warren A. Hunt, Jr., The University of Texas at Austin Austin, TX 78705 | hunt@cs.utexas.edu Georg Weissenbacher, TU Wien Karlsplatz 13, 1040 Wien, Austria | georg.weissenbacher@tuwien.ac.at

The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing.

Information on this publication series and the volumes published therein is available at www.tuwien.ac.at/academicpress.

Volume 2 edited by: Ruzica Piskac, Yale University 51 Prospect Street, New Haven, CT 06511, USA | ruzica.piskac@yale.edu Michael W. Whalen, Amazon Web Services, Inc. 323 N Washington Ave, Minneapolis, MN 55401, USA | mww@amazon.com Ruzica Piskac / Michael W. Whalen (Eds.)

# **PROCEEDINGS OF THE 21ST CONFERENCE ON FORMAL METHODS IN COMPUTER-AIDED DESIGN – FMCAD 2021**

#### **TU Wien Academic Press, 2021**

c/o TU Wien Bibliothek TU Wien Resselgasse 4, 1040 Wien academicpress@tuwien.ac.at www.tuwien.at/academicpress

This work is licensed under a Creative Commons attribution 4.0 international license (CC BY 4.0). https://creativecommons.org/licenses/by/4.0/

ISSN (online): 2708-7824

ISBN (online): 978-3-85448-046-4

Available online: https://doi.org/10.34727/2021/isbn.978-3-85448-046-4

Media proprietor: TU Wien, Karlsplatz 13, 1040 Wien Publisher: TU Wien Academic Press Publication series editor: Warren A. Hunt, Jr. and Georg Weissenbacher Editors (responsible for the content): Ruzica Piskac and Michael W. Whalen

# Preface

These are the proceedings of the twenty-frst International Conference on Formal Methods in Computer-Aided Design (FMCAD), which was held online from October 18 – October 22, 2021 due to the coronavirus. FMCAD was constituted in 1996 as a conference covering formal aspects of specifcation, verifcation, synthesis, testing, and security, and as a leading forum for researchers and practitioners in academia and industry alike. 2021 marks the 25th anniversary of that original meeting, and so we wish to celebrate the vision of those original organizers!

The program of FMCAD 2021 is comprised of four tutorials, three invited talks, a student forum, an industry night, a panel session on "25 years of FMCAD", and the main program consisting of presentations of 30 accepted papers. The tutorial day featured four presentations:


and the main conference featured three invited talks:


FMCAD'21 also hosted the ninth edition of the Student Forum, which has been held annually since 2013 and provides a platform for graduate students at any career stage to introduce their research to the FMCAD community. The FMCAD Student Forum 2021 was organized by Mark Santolucito and featured short presentations of 11 accepted contributions. A detailed description of the Student Forum, listing all accepted contributions, is provided in the conference proceedings. FMCAD 2021 received 72 submissions out of which the committee decided to accept 30 for publication. Each submission received at least three reviews. The topics of the accepted papers include hardware and software verifcation, SAT, SMT, learning, synthesis, Neural-Network verifcation, and more. Out of the accepted papers, 23 are classifed as regular papers (20 long and 3 short) and 7 are classifed as tool/case study papers (5 long and 2 short).

Organizing this event would not have been possible without the support of a large number of people and our sponsors. The program committee members and additional reviewers, listed on the following pages, did an excellent job providing detailed and insightful reviews, which helped the authors to improve their submissions and guided the selection of the papers accepted for publication. We thank each and everyone of them for dedicating their time and providing their expertise. We thank William Hallahan (Yale University) for being the web master, Daniel Schoepe for being the Sponsorship Chair, and Mark Santolucito for organizing this year's FMCAD Student Forum. We thank Georg Weissenbacher (TU Wien) both for his exceptional assistance in organizing the event, communicating to us the decisions of the steering committee, as well as being the publication chair. Holding a conference like FMCAD would not be feasible without the fnancial support of our sponsors. We would like to express our gratitude to our sponsors (in alphabetical order): Amazon Web Services, Amazon Prime Video, Cadence, Centaur Technology, Galois, Intel, Mentor Graphics, Novi, and Synopsys.

The conference proceedings are available as Open Access Proceedings published by TU Wien Academic Press, and through the IEEE Xplore Digital Library. Last but not least, we thank all authors who submitted their papers to FMCAD 2021 (accepted or not), and whose contributions and presentations form the core of the conference. We are grateful to everyone who presented their paper, gave a keynote or gave a tutorial. We thank all attendees of FMCAD for supporting the conference and making FMCAD a stimulating and enjoyable event.


# Organizing Committee

# Program Co-Chairs


Georg Weissenbacher TU Wien

# Program Committee

Erika Abraham RWTH Aachen University Pranav Ashar Real Intent Per Bjesse Synopsys Ivana Cerna Masaryk University Supratik Chakraborty IIT Bombay Sylvain Conchon Université Paris-Sud Leonardo de Moura Microsoft Grigory Fedyukovich Florida State University Arie Gurfnkel University of Waterloo Liana Hadarean Amazon Web Services Ziyad Hanna Cadence Design System Fei He Tsinghua University Alexander Ivrii IBM Dejan Jovanović Amazon Web Services Alan Jovic University of Zagreb Laura Kovacs TU Wien Rebekah Leslie-Hurd Intel Ruzica Piskac Yale University Andrew Reynolds University of Iowa Christoph Scholl University of Freiburg Anna Slobodova Centaur Technology Christoph Sticksel The MathWorks Jean-Baptiste Tristan Boston College Yakir Vizel The Technion Thomas Wahl Northeastern University Georg Weissenbacher TU Wien Thomas Wies New York University Valentin Wüstholz ConsenSys

Jade Alglave University College London Roderick Bloem Graz University of Technology Rayna Dimitrova CISPA Helmholtz Center for Information Security Marijn Heule Carnegie Mellon University Warren A. Hunt, Jr. The University of Texas at Austin Ton Chanh Le Stevens Institute of Technology Kuldeep S. Meel National University of Singapore Elizabeth Polgreen University of California, Berkeley Natasha Sharygina Università della Svizzera italiana (USI Lugano, Switzerland) Murali Talupur Amazon Web Services, Inc. Michael Whalen Amazon Inc. and the University of Minnesota Lenore Zuck University of Illinois in Chicago

# Additional Reviewers

Asadi, Sepideh Athanasiou, Konstantinos

Bansal, Suguman Barnett, Lee Bendı́k, Jaroslav Blicha, Martin Bustan, Doron

Cano, Filip Chalupa, Marek Cheang, Kevin Chen, Hao Chernigovskaia, Lidiia

Ebrahimi, Masoud

Fan, Hongyu Fernandez, Matt Fraer, Ranan

Georgiou, Pamina Goel, Shilpi Golia, Priyanka Grundy, Jim

Hamza, Ameer Hjort, Håkan Hoereth, Stefan Hozzová, Petra Huang, Daniel Hyvärinen, Antti

Jacoby, Reily Jain, Himanshu Jain, Mitesh Jin, Hoon Sang Jonas, Martin

Könighofer, Bettina Kwan, Carl

Larrauri, Alberto Le, Nham

Maderbacher, Benedikt Majumdar, Rupak Moosbrugger, Marcel Mora, Federico

Nalbach, Jasper

Otoni, Rodrigo

Ramanathan, Vivek Rane, Ashay Reeves, Joseph Rehak, Vojtech Ročkai, Petr

Santolucito, Mark Schoisswohl, Johannes Seufert, Tobias Shi, Yunong Soos, Mate Stankovic, Miroslav Strejček, Jan Strichman, Ofer Sumners, Rob Swords, Sol

Tassarotti, Joseph Temel, Mertcan

Vediramana Krishnan, Hari Govind

Wolfovitz, Guy

# Table of Contents

# Tutorials


# Model Checking and IC3


# Applied Verifcation and Synthesis


# SAT Solving


# SMT and First-Order Logic


# Reactive Synthesis Beyond Realizability

Rayna Dimitrova *CISPA Helmholtz Center for Information Security* Saarbrucken, Germany ¨ dimitrova@cispa.de

*Abstract*—The automatic synthesis of reactive systems from high-level specifcations is a highly attractive and increasingly viable alternative to manual system design, with applications in a number of domains such as robotic motion planning, control of autonomous systems, and development of communication protocols. The idea of asking the system designer to describe what the system should do instead of how exactly it does it, holds a great promise. However, providing the right formal specifcation of the desired behaviour of a system is a challenging task in itself. In practice it often happens that the system designer provides a specifcation that is unrealizable, that is, there is no implementation that satisfes it. Such situations typically arise because the desired behavior represents a trade-off between multiple conficting requirements, or because crucial assumptions about the environment in which the system will execute are missing. Addressing such scenarios necessitates a shift towards synthesis algorithms that utilize quantitative measures of system correctness. In this tutorial I will discuss two recent advances in this research direction.

First, I will talk about the maximum realizability problem, where the input to the synthesis algorithm consists of a hard specifcation which must be satisfed by the synthesized system, and soft specifcations which describe other desired, possibly prioritized properties, whose violation is acceptable. I will present a synthesis algorithm that maximizes a quantitative value associated with the soft specifcations, while guaranteeing the satisfaction of the hard specifcation. In the second half of the tutorial I will present algorithms for synthesis in bounded environments, where a bound is associated with the sequences of input values produced by the environment. More concretely, these sequences consists of an initial prefx followed by a fnite sequence repeated infnitely often, and satisfy the constraint that the sum of the lengths of the initial prefx and the loop does not exceed a given bound. I will also discuss the synthesis of approximate implementations from unrealizable specifcations, which are guaranteed to satisfy the specifcation on at least a specifed portion of the bounded-size input sequences. I will conclude by outlining some of the open avenues and challenges in quantitative synthesis from temporal logic specifcations.

This tutorial is based on joint work with Mahsa Ghasemi and Ufuk Topcu published in [1], [2], and with Bernd Finkbeiner and Hazem Torfah published in [3].

### REFERENCES


# Stainless Verification System Tutorial

Viktor Kuncak ˇ *LARA Research Group School of Computer and Communication Sciences EPFL* Lausanne, Switzerland

viktor.kuncak@epfl.ch

Jad Hamza *LARA Research Group School of Computer and Communication Sciences EPFL* Lausanne, Switzerland jad.hamza@epfl.ch

*Abstract*—Stainless ( https://stainless.epfl.ch ) is an open-source tool for verifying and finding errors in programs written in the Scala programming language. This tutorial will not assume any knowledge of Scala. It aims to get first-time users started with verification tasks by introducing the language, providing modelling and verification tips, and giving a glimpse of the tool's inner workings (encoding into functional programs, function unfolding, and using theories of satisfiability modulo theory solvers Z3 and CVC4).

Stainless (and its predecessor, Leon) has been developed primarily in the EPFL's Laboratory for Automated Reasoning and Analysis in the period from 2011-2021. Its core specification and implementation language are typed recursive higher-order functional programs (imperative programs are also supported by automated translation to their functional semantics). Stainless can verify that functions are correct for all inputs with respect to provided preconditions and postconditions, it can prove that functions terminate (with optionally provided termination measure functions), and it can provide counter-examples to safety properties. Stainless enables users to write code that is both executed and verified using the same source files. Users can compile programs using the Scala compiler and run them on the JVM. For programs that adhere to certain discipline, users can generate source code in a small fragment of C and then use standard C compilers.

*Index Terms*—verification, formal methods, proof, counterexample, model checking, Scala, functional programming, satisfiability modulo theories

# I. INTRODUCTION

Stainless [1] is a tool for verifying and finding errors in programs written in a subset of the Scala [2] programming language. Stainless is open source (distributed under Apache license) and hosted on GitHub at:

# https://github.com/epfl-lara/stainless/ https://epfl-lara.github.io/stainless/

Stainless (and its predecessor, Leon) have been developed primarily in the EPFL's Laboratory for Automated Reasoning and Analysis in the period from 2011-2021, see, in particular [1], [3] as well as [4]–[14]. The core specification and implementation language of Stainless are typed recursive higherorder functional Scala programs. It also supports certain imperative programs [4], [6]. Stainless can verify that functions are correct for all inputs with respect to provided preconditions and postconditions, it can prove that functions terminate (with optionally provided termination measure functions), and it can also provide counter-examples to safety properties.

Stainless can be used to write programs that are directly executable and proven correct. In particular, because it uses Scala's syntax and type system, users can execute Stainless programs using the standard Scala compiler (version 2.12.13 at the time of writing). In addition, there are passes that eliminate non-executable (ghost) code from source to make sure that it does not result in run-time overhead after compilation. For programs that adhere to certain discipline the "genc" option of Stainless can be used to generate C source code that compiles with common compilers such as gcc.

### *A. Outline*

In this tutorial, we show examples demonstrating how to use Stainless to develop verified models and programs. We will mostly use basic notation for functional programming, which we will introduce along the way. We will use Stainless version 0.9 or later.

In addition to basic introduction, we will suggest strategies for specifying programs and helping Stainless prove them correct. An example is using lemmas and proving them by induction expressed through terminating recursion.

To help users be more effective when using Stainless, we also outline key mechanisms that Stainless uses in proof and counterexample search: encoding into functional programs, function unfolding, and using rich theories of satisfiability modulo theory solvers Z3 and CVC4.

# II. GETTING STARTED

Stainless is a command line application that runs on the Java virtual machine, version 1.8. We mostly test it on Ubuntu Linux. We provide releases for Linux and Mac. Others use it on Windows as well, where it may be simplest to use Windows Subsystem for Linux to get started. Download the release file from

# https://github.com/epfl-lara/stainless/releases/

then unzip the file and put a link to stainless in your path.

The following is a simple program, call it MaxBug.scala, containing a function max. Max attempts to compute maximum of the two 32-bit integers by returning one of them, depending on the sign d of their difference.

```
object TestMax {
 def max(x: Int, y: Int): Int = {
   val d = x - y
   if (d > 0) x
   else y
 } ensuring(res =>
   x <= res && y <= res && (res == x || res == y))
}
```
We use **object** to group functions into modules. We define functions using **def** and provide their parameters (here: x and y) and their types, as well as the return type. We define local immutable values using **val** keyword. Scala infers the type of d as Int.

After the usual body, we introduced an **ensuring** statement. The first identifier, res, binds the return value of the function. After the arrow => we state the property we would like the result to satisfy. In this case, the result should be greater than each argument and it should be equal to one of them.

Invoke stainless MaxBug.scala and you may get output containing some of the following.

```
MaxBug.scala:7:49: warning: => INVALID
 x <= res && y <= res && (res == x || res == y))
                                             ˆ
warning: Found counter-example:
warning: y: Int -> -2147483648
         x: Int -> 1
Verified: 0 / 3
stainless summary
MaxBug.scala:3:13: max Subtraction overflow invalid
MaxBug.scala:7:37: max postcondition invalid
MaxBug.scala:7:49: max postcondition invalid
...................................................
total: 3 valid: 0 (0 from cache) invalid: 3
```
Use --timeout=5 to set time out to 5 seconds. and --no-colors to request clean ASCII output with parsable line numbers in reports.

Why did Stainless report a counterexample? Indeed, executing max with the two provided values computes using signed 32-bit arithmetic the value -11 for d, so the function returns y as the result res so y <= res is false. We can repair this example in at least two ways:


If you run your program several times, you may notice that Stainless reports that a valid verification condition was persistently cached (inside .stainless-cache). You can turn off caching with --vc-cache=false.

You may find the --watch option useful when modifying a file several times, which makes Stainless run verification whenever the source file is changed.

By default, Stainless uses a version of z3 (4.7.1) which is packaged inside Stainless (--solvers=nativez3). This allows Stainless to interact with z3 through Java calls. You may also use an externally built version of z3 (for instance, z3 4.8.12 is shipped with the release) by specifying --solvers=smt-z3. In that case, Stainless will communicate with z3 using SMT-LIB files, which might be slower than Java calls, but has two benefits. First, you get to use the newest release of z3. Second, smt-z3 is more likely to respect timeouts than nativez3.

You can also use CVC4 as the solver if you download and put cvc4 executable on your path. You can use both with --solvers=smt-cvc4,smt-z3. Use --debug=smt to preserve the generated SMT-LIB files and look for them in the smt-sessions directory.

#### III. VERIFIED FUNCTIONAL PROGRAMMING

We will now implement a simple function that computes differences of successive elements of a list. Let us start our file with **import** stainless.collection.**\_** so we can use the immutable List library of Stainless. You can find the sources of this and other library files at following URL:

# https://github.com/epfl-lara/stainless/blob/master/frontends/ library/stainless/collection/List.scala

Let's try to write a function diffs that takes a list of elements, for example x1, x2, x3, x<sup>4</sup> and keeps the first element and then follows it by the list of their differences. In this case we would like to obtain x1, x<sup>2</sup> − x1, x<sup>3</sup> − x2, x<sup>4</sup> − x3. For empty and one-element list the output equals input. Let us write this as the default implementation. We can also state the example of four-element list as a symbolic test case. To state it, we use another function with a dummy body and a postcondition that invokes diffs.

```
import stainless.collection._
object Diffs {
 def diffs(l: List[BigInt]): List[BigInt] = {
   l match {
    case Nil() => l
    case _ :: Nil() => l
    // missing cases
   }
 }
 def test(x1: BigInt, x2: BigInt,
        x3: BigInt, x4: BigInt): Unit = {
 } ensuring(_ =>
   diffs(List(x1,x2,x3,x4)) ==
    List(x1, x2 - x1, x3 - x2, x4 - x3))
}
```
After developing a function that meets this partial specification, we can see whether it meets a stronger specification. For example, we can define the inverse function undiff that takes y0, y1, . . . , y<sup>n</sup> and computes y0, y<sup>0</sup> + y1, . . . , P<sup>n</sup> <sup>i</sup>=0 y<sup>i</sup> . Being masters of functional programming, we recognize that this is just a prefix sum of a list, so we define it by

```
def undiff(l: List[BigInt]): List[BigInt] =
 l.scanLeft(BigInt(0))(_ + _).tail
```
where scanLeft is defined in our List library. Now we can add as the **ensuring** condition of diffs the condition **ensuring** (res => (undiff(res)== l)). It so happens that Stainless proves this condition automatically using its algorithm. As an off-line exercise, try to prove this result with pen and paper. This might give you a sense on how Stainless is able to prove this property.

The algorithm of Stainless initially treats called functions as unknown (uninterpreted) mathematical functions. It then iteratively expands each call by defining the function to be equal to one unfolding of its body and also inserts the **ensuring** clause as an assumption.

#### IV. AMORTIZED QUEUE

We have found Stainless to work very well for verification of purely functional data structures. Let us examine the case of an amortized queue such as the one from [15, Section 5.2, Page 42]. We will start by writing down an *abstract class*. In this class we define methods with dummy bodies denoted by ??? but with **ensuring** clauses that specify the desired behavior of operations. To specify the behavior we use toList function, which is also left unspecified in the abstract class.

```
import stainless.collection._
import stainless.lang._
abstract class Queue[A] {
 def enqueue(a: A) = (??? : Queue[A])
   .ensuring(res =>
   res.toList == this.toList ++ List(a))
 def dequeue: Option[(A, Queue[A])] =
   (??? : Option[(A, Queue[A])])
 .ensuring(res => res match {
   case None() =>
    this.toList == Nil[A]()
   case Some((a, q)) =>
    this.toList == a :: q.toList
   })
 def toList: List[A]
}
```
When we extend the abstract class, Scala requires us to define toList, whereas Stainless ensures that our implementation meets the specifications in the abstract class. We can implement an inefficient queue using a single list.

```
case class SimpleQueue[A](l: List[A])
   extends Queue[A] {
 def enqueue(a: A) = SimpleQueue(l ++ List(a))
 def dequeue = l match {
   case Nil() => None()
   case Cons(x, xs) => Some((x, SimpleQueue(xs)))
 }
 def toList = l
}
```
Stainless successfully verifies that the properties required by a queue are satisfied by this implementation. Even if correct, this implementation is inefficient because enqueue takes linear time in the current number of queue elements. We will thus try to develop and prove correct the implementation like one from [15, Section 5.2, Page 42] that uses two lists and that has constant time amortized complexity.

```
case class AmortizedQueue[A](front: List[A],
                      rear: List[A])
   extends Queue[A] {
```
#### **def** toList = front ++ rear.reverse

The toList, which we use only for specification, gives us a hint on how to implement enqueue efficiently. For dequeue we will need a reverse operation on lists, which we can implement in linear time. Despite its complexity, our version of dequeue will be verified automatically. As for enqueue, its implementation is simple, yet its proof turns out to require some well known property of lists that we need to tell Stainless to invoke explicitly!

```
def enqueue(a: A): Queue[A] = {
 val res: Queue[A] = // to fill
 // You can state using assertions things you know are true,
 // to see if Stainless is able to prove them:
 assert(res.toList == front ++ (a :: rear).reverse)
 // Alternatively, you can use an equation style reasoning.
 // Here Stainless should timeout from the second to the third
 // step, because some steps are missing.
 (
   res.toList ==:| trivial |:
   front ++ (a :: rear).reverse ==:| trivial |:
   // Add missing steps here to arrive to the result.
   // For complicated steps, you need to invoke lemmas
   // instead of writing 'trivial'.
   this.toList ++ List(a)
 ).qed
 res
}
```
#### V. PROPERTIES AND PROOFS

How do we state properties in Stainless? We write a property ∀x : T.F(x) as a function lemmaF defined by:

**def** lemmaF(x: T): Unit = { () } **ensuring** (**\_** => F(x))

When we wish to instantiate the property taking x to be some specific value v, we insert a function invocation lemmaF(v) into the part of the code where we need this property. Suppose that proving property ∀x : T.F(x) is not automatic. Then verification of lemmaF itself will fail, as stated. If F(x), for example, follows from G(x, x + 1) that is established in lemmaG(x,y), then we can state and prove lemmaF as:

**def** lemmaF(x: T): Unit = { lemmaG(x,x+1) } **ensuring** (**\_** => F(x))

Thus, we can adopt the following strategies for libraries of lemmas:


Purely universal statements can return Unit type. For existential statements, we can often state their constructive Skolemized form and return a witness for the existential quantifier from the lemma.

It can be helpful to examine some proofs of properties in the List library. Remarkably, we can even make recursive invocations of functions in their bodies. Which mathematical reasoning principle do such proofs correspond to?

#### VI. DIGITS

For built-in types such as Int and Long, the SMT solvers will successfully reason about their bitwidth representation. What if we wish to reason about the bits of arbitrarily large numbers? As a simple example, let us define simple addition as a recursive function on lists of bits.

```
import stainless.annotation._
import stainless.lang._
import stainless.collection._
object AddBitwise {
 type Digits = List[Boolean]
 val zero = Nil[Boolean]()
 def add(x: Digits, y: Digits, carry: Boolean):
     Digits = {
   require(x.length == y.length)
   (x,y) match {
    case (Nil(), Nil()) =>
      if (carry) true::zero else zero
    case (Cons(x1,xs), Cons(y1,ys)) => {
      val z = x1 ˆ y1 ˆ carry
      val carry1 = (x1 && y1) ||
                (x1 && carry) ||
                (y1 && carry)
      z :: add(xs, ys, carry1)
    }
   }
 }
}
```
How can we state that such addition is commutative? How can we prove it in Stainless? As an off-line exercise, think about how we can prove that this corresponds to actual addition on integers (BigInt).

#### VII. TERMINATION

The following recursive function searches for an element in a sorted array, but it has a bug. You may run Stainless on this file to spot it. Fix the issue, and add a **decreases** clause at the beginning of the function to ensure that Stainless can prove the function terminating.

```
import stainless.lang._
object BinarySearch1 {
 def search(arr: Array[Int], x: Int, lo: Int, hi:
     Int): Boolean = {
   if (lo <= hi) {
    val i = (lo + hi) / 2
    val y = arr(i)
    if (x == y) true
    else if (x < y) search(arr, x, lo, i-1)
    else search(arr, x, i+1, hi)
   } else {
    false
   }
 }
}
```
In Stainless, all functions are required to have a measure (either inferred automatically, or written in a **decreases** clause by the user). The system in its current design would be unsound (we would be able to prove false postconditions or assertions) if we allowed non-terminating functions.

#### VIII. IMPERATIVE FEATURES

Stainless supports some imperative features, such as local mutable variables, while loops, return statements, and more (see https://epfl-lara.github.io/stainless/imperative.html). Stainless transforms these constructs into functional programs.

Using a while loop and a return statement, rewrite the findIndexOpt function:

```
def findIndexOpt(ar: Array[Int], v: Int):
                                      Option[Int] = {
```
}

that finds an index of element v in a sorted array ar. Prove that, when your function returns Some(i), then ar(i)== v. To prove that array indices are within bounds, you will need a loop invariant, for which the syntax is:

```
(while(...) {
 decreases(...)
 ...
}).invariant(...)
```
Does Stainless help you if you make an overflow mistake when computing the middle of an interval using bounded arithmetic?

Note that while loops require **decreases** clauses as well (when the measure cannot be inferred automatically), because they are translated into recursive functions by Stainless. To see how the while loop and the return statement are transformed, you may run the command below on your file. Stainless has a pipeline containing several phases, and ReturnElimination is the one that removes while loops and return statements. The --debug-objects option tells Stainless to only display the findIndexOpt function in the debug output.

```
stainless --debug=trees
    --debug-objects=findIndexOpt
    --debug-phases=ReturnElimination FindIndex.scala
```
As a harder exercise, identify and prove a stronger postcondition of findIndexOpt: what can we state in the postcondition for the case when the function returns None? What assumptions and loop invariants do we need to be be able to prove this postcondition?

#### IX. DESIGN PRINCIPLES

A number of verification systems have been developed in the past decades. Stainless tries to borrow many of the features that others and us have found useful in other systems. At the same time, it is driven by a somewhat unique combination of principles, whose understanding may help set the expectations from the tool.

#### *A. Searching for Both Proofs and Counterexamples*

From the beginning [13], the system was designed to search for both counterexamples and proofs in a unified iterative loop. Thanks to this design, on many programs Stainless behaves like a combination of a bounded model checker and a kinductive prover such as [16]: we can often expect a definite answer, whether the program verifies or has a counterexample.

### *B. Recursive programs as foundation, not transition systems.*

Operational semantics tells us that we can translate functional (and many other) programs into transition systems. This has even been used in verification tools with success []. Nonetheless, we believe that it carries significant overhead, especially for proofs. Thus, like in ACL2 [17], [18] our intermediate representation is based on recursive functions [13] and we hope to leverage high-level structure to make verification more feasible, much like Liquid Haskell [19] which needs to be complemented with symbolic execution to also generate counterexamples [20]. Consequently, iterative unfolding of our recursive functions in Stainless gives a different sequence of approximations than the one we would obtain by representing programs using control-flow graphs and explicit stacks [21].

#### *C. Top-down verification for each function.*

Stainless verifies each desired function one by one. When verifying a function f, it does not check which other parts of code invoke f. In particular, it will, in its current design, not infer preconditions for a function automatically. Preconditions need to be explicitly specified using a **require** clause at function entry. On the other hand, when Stainless examines the body of f and finds a function g, then it will examine not only the specification of g, but also its body. If g is recursive, this process will continue, with a check for counterexample and check for unsatisfiability performed at each step. This process treats functions more transparently than some modular verifiers. The process is also breadth-first, instead of having the form of directed rewriting as in some other systems. The effectiveness of this process is explained in part by the fact that it results in a decision procedure for certain classes of functions [14], [22], [23]. Furthermore, we continue to be surprised by how well this simple strategy works in practice, even if we have no theoretical reason to know that it will succeed.

#### *D. Scala subset as the input language.*

Stainless uses Scala as a language that has substantial user base, regularly ranked higher than Haskell and LISP in Stack Overflow developer surveys [24], which is relevant for maintaining the correspondence between what executes and that is verified. As a functional language, Scala contains an expressive purely functional fragment which can be used for specification and modelling. The users of Stainless thus largely avoid the need to learn a separate specification language, because functional programs are a great specification vehicle. At the same time, the system supports polymorphism and subtyping with a type system that eliminates many nonsensical programs before they waste user's time inside the program verifier's loop. That said, Stainless purposely avoids by design certain Scala 2 features, such as null references and complex initalization. Other features, such as machine integers, are modelled precisely: it is certainly necessary in practice to have machine integers of various width available (for example, 32-bit Int and 64-bit Long), but it is also helpful to use unbounded BigInt data types, especially for specifications, and these different types should not be confused. Stainless provides the user a choice and maps these data types and operations on them to the appropriate types and theories inside SMT solvers [8]. Subtyping is currently implemented via a translation into a language with disjoint types [3]; its use requires additional encoding and may slow down verification. Imperative features are supported as a choice of either unshared mutable state [6] or using a model [4] that, at user level, is similar to dynamic frames [25] of Dafny [26].

#### *E. Embracing SMT solver theories, avoiding quantifiers.*

Instead of using axioms to encode program semantics and data types, Stainless leverages algebraic data types, sets, and arrays. Stainless thus currently emits quantifier-free queries to solvers (either Z3 or CVC4). The hope with this choice is that SMT solvers will remain predictable for both proofs and counterexamples. In contrast, the use of quantifiers may lead to more automation and sometimes excellent performance for proofs, but quickly leads outside of the space where the solvers can reliable report counterexamples.

#### *F. Executability of programs and specifications.*

In Stainless we aim to write programs that can be compiled using the standard Scala compiler. Specification constructs in Stainless are defined in a Scala library and they have dummy execution semantics. In some cases, even such dummy semantics may result in overhead, so we have developed passes that eliminate some of the specification code altogether. In addition, Stainless has a subset that can be used to generate C code suitable for embedded systems, an enhanced version of such functionality developed for Leon [27].

Acknowledgements. Research on Stainless has been funded in part by (i) the Swiss Science Foundation grants 200021 132176, 200020 138204, 200020 146649, 200021 144503, 200020 159949, and 200021 175676. (ii) European Research Council (ERC) Starting Grant PE6-306484-IMPRO, (iii) The Swiss State Secretariat for Education, Research and Innovation, Swiss Space Office grant "Embedded Flight Software Verification–ESOVER" and (iv) the envelope budget for the LARA group from the EPFL School of Computer and Communication Sciences.

Stainless and Inox were created from parts of Leon code by Nicolas Voirol. In addition to Nicolas and the two authors of this tutorial, contributors to Stainless and Inox include: Roman Ruetschi, Georg Stefan Schmid, Marco Antognini, Ravichandhran Madhavan, Etienne Kneuss, Lars Hupel, Emmanouil Koukoutos, Philippe Suter, Roman Edelmann, Utkarsh Upadhyay, Ivan Kuraj, Sandro Stucki, Ruzica Piskac, Tihomir Gvero, Czipo Bence, Sumith Kulal, Lucien Iseli, ´ Regis Blanc, Iulian Dragos, Dragana Milovancevi ˇ c, Antoine ´ Brunner, Mirco Dotta, Yann Bolliger, Rodrigo Raya, Samuel Gruetter, Mikael Mayer, Guillaume Mass ¨ e. Romain Jufer ´ worked with Jad Hamza on a fork for smart contract verification and Solidity code generation, Romain Edelmann and Rodrigo Raya developed an interactive proof assistant concept based on Inox. Regis Blanc developed a Scala library for input and output of SMT-LIB files. ScalaZ3 interface to the Z3 dynamically linked library additionally received contributions from Ali Sinan Koksal and Thorsten Tarrach. Contributors ¨ to Stainless Bolts case studies include additionally Samuel Chassot and Clement Burgelin. We thank users of Stainless ´ from Ateleris GmbH including Simon Felix, Filip Schramka, and Ivo Nussbaumer. We also thank MSc students at EPFL taking the Formal Verification course, completing interesting case studies and identifying bugs in the system.

#### REFERENCES


# Formal Methods for the Security Analysis of Smart Contracts

Mattei Maffei

*TU Wien* Vienna, Austria matteo.maffei@tuwien.ac.at

*Abstract*—Smart contracts consist of distributed programs built over a blockchain and they are emerging as a disruptive paradigm to perform distributed computations in a secure and efficient way. Given their nature, however, program flaws may lead to dramatic financial losses and can be hard to fix. This motivates the need for formal methods that can provide smart contract developers with correctness and security guarantees, ideally automating the verification task.

This tutorial introduces the semantic foundations of smart contracts and reviews the state-of-the-art in the field, focusing in particular on the automated, sound, static analysis of Ethereum smart contracts. We will highlight the strengths and drawbacks of different methods, suggesting open challenges that can stimulate new research strands. Finally, we will overview eThor, an automated static analysis tool that we recently developed based on rigorous semantic foundations.

#### Active Automata Learning: from L ∗ to L #

Frits Vaandrager *Radboud University* Nijmegen, The Netherlands F.Vaandrager@cs.ru.nl

*Abstract*—In this tutorial on active automata learning algorithms, I will start with the famous L ∗ algorithm proposed by Dana Angluin in 1987, and explain how this algorithm approximates the Nerode congruence by means of refnement. Next, I will present a brief overview of the various improvements of the L ∗ algorithm that have been proposed over the years. Finally, I will introduce L #, a new and simple approach to active automata learning. Instead of focusing on equivalence of observations, like the L ∗ algorithm and its descendants, L # takes a different perspective: it tries to establish *apartness*, a constructive form of inequality.

# From Viewstamped Replication to Blockchains

Barbara Liskov MIT Computer Science & Artificial Intelligence Lab Cambridge, MA, USA liskov@csail.mit.edu

*Abstract*—This talk will discuss two replication protocols. The first, Viewstamped Replication, was developed in the 1980s when research on replication protocols was concerned primarily with systems that survived crash failures, e.g., individual replicas could fail only by crashing. Viewstamped replication is similar to Paxos; it was the earliest practical replication algorithm that provided the ability to execute general operations (as opposed to just reads and writes).

In the 1990s, researchers became interested in systems that could survive Byzantine failures, in which replicas fail arbitrarily. Replicated systems that survive Byzantine failures are substantially more complex, requiring both more replicas and more phases of communication, than those that survive only crash failures. The talk will present PBFT, the first practical replication technique that handles Byzantine failures. PBFT is now of great interest to researchers working on blockchains.

Formal Methods in Computer-Aided Design 2021

# Algorithms for the People

Seny Kamara *Brown University* Providence, Rhode Island, USA seny@brown.edu

*Abstract*—Algorithms have transformed every aspect of society, including communication, transportation, commerce, fnance, and health. The revolution enabled by computing has been extraordinarily valuable. The largest tech companies generate a trillion dollars a year and employ 1 million people. But technology does not affect everyone in the same way. In this talk, we will examine how new technologies affect marginalized communities and think about what technology and academic research would look like if its goal was to serve the disenfranchised.

# Engineering with Full-scale Formal Architecture: Morello, CHERI, Armv8-A, and RISC-V

Peter Sewell University of Cambridge Cambridge, UK Peter.Sewell@cl.cam.ac.uk

*Abstract*—Architecture specifications define the fundamental interface between hardware and software. Historically, mainstream architecture specifications have been informal prose-andpseudocode documents. This talk will describe our work to establish and use mechanised semantics for full-scale instruction-set architectures (ISAs): the mainstream Armv8-A architecture, the emerging RISC-V architecture, the CHERI-MIPS and CHERI-RISC-V research architectures that use hardware capabilities for improved security, and Arm's prototype Morello architecture – an industrial demonstrator incorporating the CHERI ideas.

We use a variety of tools, especially our Sail ISA definition language and Isla symbolic evaluation engine, to build semantic definitions that are readable, executable as test oracles, support reasoning within the Coq, HOL4, and Isabelle proof assistants, support SMT-based symbolic evaluation, support model-based test generation, and can be integrated with operational and axiomatic concurrency models. These models are all complete enough to boot operating systems and hypervisors, covering the full sequential ISA (though not other SoC components, such as the Arm Generic Interrupt Controller). They range from 5000 to 60000 lines of specification.

For CHERI-MIPS and CHERI-RISC-V, we have used Sail models (and previously L3 models) as the golden reference during design, working with our systems and computer architecture colleagues in the CHERI team to use lightweight formal specification routinely in documentation, testing, and test generation. We have stated and proved (in Isabelle) some of the fundamental intended security properties of the full CHERI-MIPS ISA.

For Armv8-A, building on Arm's internal shift to an executable model in their ASL language, we have the complete sequential ISA semantics automatically translated from the Arm ASL to Sail, and for RISC-V, we have hand-written what is now the offically adopted model. For their concurrent semantics, the "user" semantics, partly as a result of our collaborations with Arm and within the RISC-V concurrency task group, have become simplified and well-defined, with multiple models proved equivalent, and we are currently working on the "system"

This work was partially supported by the UK Government Industrial Strategy Challenge Fund (ISCF) under the Digital Security by Design (DSbD) Programme, to deliver a DSbDtech enabled digital platform (grant 105694), ERC AdG 789108 ELVER, EPSRC programme grant EP/K008528/1 REMS, Arm iCASE awards, EPSRC IAA KTF funding, the Isaac Newton Trust, the UK Higher Education Innovation Fund (HEIF), Thales E-Security, Microsoft Research Cambridge, Arm Limited, Google, Google DeepMind, HP Enterprise, and the Gates Cambridge Trust. Approved for public release; distribution is unlimited. This work was supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL), under contracts FA8750-10-C-0237 ("CTSRD"), FA8750- 11-C-0249 ("MRC2"), HR0011-18-C-0016 ("ECATS"), and FA8650-18-C-7809 ("CIFV"), as part of the DARPA CRASH, MRC, and SSITH research programs. The views, opinions, and/or findings contained in this report are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

semantics. Our symbolic execution tool for Sail specifications, Isla, supports axiomatic concurrency models over the full ISA.

Morello, supported by the UKRI Digital Security by Design programme, offers a path to hardware enforcement of finegrained memory safety and/or secure encapsulation in the production Armv8-A architecture, potentially excluding or mitigating a large fraction of today's security vulnerabilities for existing C/C++ code with little modification. During the ISA design process, we have proved (in Isabelle) fundamental security properties for the complete Morello ISA definition, and generated tests from the definition which were used during hardware development and for QEMU bring-up.

All these tools and models are (or will soon be) available under open-source licences, providing well-validated models for others to use and build on.

This is joint work by many people, including especially, *for Sail and Isla:* Alasdair Armstrong, Brian Campbell, Kathryn E. Gray, Mark Wassell, Jon French, Neel Krishnaswami; *for Morello verification and ASL-to-Sail translation:* Thomas Bauereiss, Thomas Sewell, Brian Campbell, Alasdair Armstrong, Alastair Reid; *for Morello and CHERI-MIPS test generation:* Brian Campbell; *for CHERI-MIPS verification:* Kyndylan Nienhuis; *for RISC-V and CHERI-RISC-V specifications:* Robert M. Norton, Prashanth Mundkur, Jessica Clark; *for MIPS and CHERI-MIPS specifications:* Alexandre Joannou, Anthony Fox, Michael Roe, Matthew Naylor; *and for Concurrency semantics:* Christopher Pulte, Shaked Flur, Will Deacon, Ben Simner, Luc Maranget, Susmit Sarkar, Jean Pichon-Pharabod, Ohad Kammar, Jeehoon Kang, Sung-Hwan Lee, Chung-Kil Hur. All this is in collaboration with the rest of the CHERI team and others in Arm (especially Richard Grisenthwaite, Graeme Barnes, and the Morello team) and in the RISC-V community, with the CHERI team jointly led by Robert N. M. Watson, Simon W. Moore, Peter Sewell, Peter G. Neumann, and Ian Stark.

Fig. 1. Sail models and infrastructure (grayed-out models are partial ISA models in an older version of Sail)

This article is licensed under a Creative Commons Attribution 4.0 International License

X

Formal Methods in Computer-Aided Design 2021

# The FMCAD 2021 Student Forum

Mark Santolucito *Barnard College, Columbia University* New York City, USA msantolu@barnard.edu

*Abstract*—The Student Forum at the International Conference on Formal Methods in Computer-Aided Design (FMCAD) gives undergraduate and graduate students the opportunity to engage with to the Formal Methods community by presenting their working and receiving feedback. The Student Forum was held in a hybrid format, with some students participating in limited in-person events in New Haven, Connecticut, USA.

The Graduate Student Forum was first introduced in 2013 to the FMCAD conference series. The goal of the Forum is to enable graduate students to attend the conference, even if they do not have a paper accepted at the main conference track. Students were attracted with an opportunity to present their on-going work to a broader scientific audience and receive valuable feedback about the research they are currently pursuing.

FMCAD 2021 hosted the ninth edition of the Student Forum. There was an open call for papers from both undergraduate and graduate students working broadly in the area of Formal Methods. In the call, students were asked to submit a 2 page summary of their current research and on-going work. We received a number of high quality submissions to the Student Forum and accepted a total of 10 submissions. Reviews were based on the overall quality and novelty of work, the potential for impact of the work on the field of Formal Methods, as well as the potential positive impact on the student to have the opportunity to participate in the forum.

This year, the Student Forum allowed for the submission of joint research where two student researchers collaborated and contributed equally in the eyes of their advisors. The topics covered by the accepted submissions ranged across the field of Formal Methods, including foundational advancements as well as a variety of application domains. The accepted submissions are listed below with their respective student authors:


This edition of the FMCAD Student Forum follows a series of previous successful iterations of the forum [1]–[8].

We would like to thank the organizers of FMCAD, as well as the entire program committee of FMCAD, who have made the FMCAD student forum possible. Additionally, we are grateful to the student authors and their research mentors who have contributed their excellent work to the program.

#### REFERENCES


# COCOALMA: A Versatile Masking Verifer

Vedad Hadziˇ c´ *Graz University of Technology*

Roderick Bloem *Graz University of Technology*

*Abstract*—Masking techniques are an effective countermeasure against power side-channel attacks. Unfortunately, correctly masking a hardware circuit is diffcult, and mistakes may lead to functionally correct circuits with insuffcient protection. We present COCOALMA, a tool that formally verifes the side-channel resistance of stateful hardware circuits. Although COCOALMA was initially used to verify programs running on CPUs, we extended it to verify the security of several industrial masked hardware implementations. We give an overview of the tool's structure, implementation details, optimizations that make it faster and more scalable than its predecessor REBECCA, and changes that enable verifying the probing security of any stateful hardware circuit. Finally, we evaluate COCOALMA with masked implementations of the PRINCE and AES ciphers.

*Index Terms*—Side-channels, Hardware masking, Formal verifcation

#### I. INTRODUCTION

Integrated circuits that process sensitive data are susceptible to passive *side-channel attacks* like differential *power analysis*. Naturally, attackers are interested in the secret keys of symmetric ciphers because that would break the confdentiality of the processed data [22], [23], [26], [21]. Classical power analysis attacks exploit the correlation of the circuit's power consumption to bits of the secret key. Ultimately, the key is reconstructed using statistic analysis techniques in a series of key guesses [22], [27].

*Masking* is an algorithmic countermeasure against power analysis attacks. It relies on splitting all secrets and intermediate computations into multiple signals. The circuit is rewritten so that attackers can only reconstruct the original value if they can observe all the shares simultaneously. Masking techniques achieve this by introducing randomness into the circuit and destroying the correlation between the powertrace and the original data. Several masking schemes describe how to make circuits secure against side-channel attacks. Among them, *domain-oriented masking* [15] and *threshold implementations* [9] are well studied and widely adopted. The security of masked hardware circuits is expressed using the *hardware probing model* [2], [18], [4], where an attacker can read the values of d wires. Traditionally, engineers validate masked hardware implementations empirically by creating power traces and computing the correlations over many executions. Recently, however, we see several formal masking verifcation methods that can substantially reduce the costs of validating power side-channel resistance of software and hardware [2], [1], [11].

This work was supported by the *Austrian Research Promotion Agency* (FFG) through the FERMION project (grant number 867542).

Figure 1. The workfow of COCOALMA showing the *parsing*, *tracing*, and *verifcation* phases, as well as their artifacts. At the end of the verifcation phase, COCOALMA either acknowledges that the analyzed design is secure or shows that a secret is leaked at a given location in the circuit.

COCOALMA is an open-source masking verifer<sup>1</sup> that assisted the hardening of a RISC-V processor<sup>2</sup> so it could safely execute masked software [13]. It considers the exact description of the hardware that runs the software and accounts for hardware leakage effects such as glitches. Figure 1 shows the workfow of COCOALMA. Starting with a hardware design written in Verilog, COCOALMA uses Yosys [31] to synthesize a fat gate-level Verilog netlist. Additionally, the parsing phase extracts a circuit graph of the synthesized design and creates a labeling template where the user can specify the contents of each register and input port of the circuit after the reset.

<sup>1</sup>https://github.com/IAIK/coco-alma

<sup>2</sup>https://github.com/IAIK/coco-ibex

COCOALMA uses a testbench provided by the user to simulate the netlist with Verilator [28], resulting in a *value change dump* showing how the internal signals changed throughout the execution. For the analysis of software running on RISC-V processors, COCOALMA additionally requires the RISC-V toolchain to compile programs and add them to the testbench before starting the simulation. The resulting execution trace is used to determine the value and glitching properties of each wire in the design. Afterward, the time-constrained probing model, initial state, simulation trace, and glitching information are encoded as a SAT problem and solved with CaDiCaL [3]. If the problem is unsatisfable, no possible observation would leak any of the secrets. Otherwise, COCOALMA gives a precise description of leakage location, the secret bits that are leaked, and a variety of other debugging information.

Although COCOALMA was frst used for analyzing software running on CPUs [13], its roots in the older verifcation tool REBECCA [4] can be leveraged towards stateful hardware verifcation of masked cipher implementations. Luckily, all the principles used in COCOALMA also apply to hardware masking verifcation with minor tweaks. In this paper, we document the inner workings of COCOALMA, its features, and show the extensions necessary for applying it to cryptographic accelerator modules. We present the following details about COCOALMA's implementation:


#### II. SECURITY MODELS

Masked implementations split all intermediate data signals x into d+1 uniformly random pieces x<sup>i</sup> , with x = x0⊕. . .⊕xd. In practice, for i ̸= d, the signal shares x<sup>i</sup> are sampled from a random number generator, whereas x<sup>d</sup> is chosen as x ⊕ x<sup>0</sup> ⊕ . . . ⊕ xd−<sup>1</sup> to ft the equality. This countermeasure tries to prevent an attacker, who can observe intermediate computations through side-channels, from learning anything about the processed data. When investigating whether a masked implementation is actually side-channel resistant, several security models describe the capabilities of an attacker and the real-world effects they can observe. COCOALMA implements three different probing models that consider different attacker capabilities and system behavior. More specifcally, this work extends COCOALMA to support continuous probing as part of the *hardware probing model*.

Software probing model. The original probing model defned by Ishai et al. [18] considers the stable state of computations, ignoring hardware side-effects such as glitches and transitions. Their seminal paper says that an attacker in this probing model can choose d intermediate values that they can observe. The attacker can then interactively query the execution of the system several times with different inputs and starting states. The inputs of the computation are declared either (a) *public*, which means that learning them does not beneft the attacker, (b) fxed uniformly random values called *masks*, or (c) parts of a secret called *shares*. The attacker's goal is to learn all the shares of a secret and use them to reconstruct the secret value they are not supposed to know. Proving that an implementation is d-probing secure requires showing that no attacker adhering to this probing model can learn the secrets, irrespective of their strategy.

Time-constrained probing model.<sup>3</sup> When COCOALMA was frst presented [13], its primary goal was verifying the masking of software programs running on an accurate description of the underlying hardware. Naturally, this required an adequate probing model that translates software probing into the hardware domain. The *time-constrained probing model* uses the gate-level description of the hardware and an execution trace generated by simulating the hardware running the software, instead of a purely algorithmic description. The goals of the attacker are the same as in the *software probing model*. However, this model is more realistic, as the attacker can probe d observation tuples (g, t), where g is a logic gate or register and t is a cycle in the execution trace. This gives an attacker access to all the intermediate values of gate g in cycle t, including all the values caused by hardware effects such as glitches and register transition leakage. The two parameters g and t are not coupled, meaning that the attacker can also probe the same gate in multiple clock cycles or even probe d different gates in the same clock cycle. Although this model limits each probe to observing only one clock cycle, instead of running throughout the computation, its inclusion of hardware effects signifcantly enhances the capabilities of an attacker.

<sup>3</sup>Barthe et al. [2] and Moos et al. [24] call this the *robust probing model*.

Due to the different signal timings in hardware, an attacker observing gate g = a ⊙ b in this model would also observe the signals a and b in addition to g. Registers are synchronous elements triggered by a clock, making them the only hardware elements exempt from this phenomenon. Another effect that increases the attacker's capabilities is transition leakage, which causes the power consumption to correlate with the linear combination g <sup>t</sup>−<sup>1</sup> ⊕ g <sup>t</sup> of the old signal value in cycle t − 1 and the new signal value in cycle t. Transition leakage applies to all hardware elements equally, including registers.

Hardware probing model. This paper extends the tool COCOALMA with a model where probes are not bound to one clock cycle like in the *time-constrained probing model*. The attacker's goals remain the same as before, only that in this more rigorous model, the probes record continuously throughout the whole computation. More precisely, instead of choosing a clock cycle for each observed location, the attacker observes all values, including those caused by glitches and transitions, that pass through a wire. In a sense, this is a more powerful rephrasing of the original probing model of Ishai et al. [18], as they also did not limit the duration of the probes for stateful circuits. As this model signifcantly increases the capabilities of an attacker, hardware designers employ random number generators to create fresh uniformly random masks in each clock cycle, intending to break any correlations that might otherwise be observed. These maskgenerating circuits are usually not part of the masked hardware designs and are only used as black-boxes that provide random inputs to the masked circuit. We incorporate this in COCOALMA, allowing designers to label input ports of a circuit as *random*. The values read from these ports behave similarly to fxed *masks*, only that they represent a new mask in each clock cycle, which is then considered during verifcation. The semantics of *public* and *share* signals remains the same, and we even allow fxed *masks*, just like in the other probing models.

#### III. VERIFICATION METHOD

COCOALMA tries to verify the side-channel resistance of a masked implementation in one of the given security models. A correctly masked implementation computes the values of arbitrary logic functions without exposing the value of the secret to an attacker through intermediate computations. Therefore, a masked implementation must ensure that intermediate signals do not correlate with *secrets*; that is, the value of an intermediate signal should be statistically independent of all secrets. COCOALMA checks whether these properties hold by tracking the correlations of each logic operation throughout the computation [4], [13]. For instance, if a circuit were to compute the expression f = a∧b, then f correlates positively with a, b, and the constant ⊥ because they have the same value in three out of four cases. For the same reason, f correlates negatively with the linear combination a⊕b because they only have the same value in one of four cases, *i.e.*, when both a and b are ⊥. An exact algorithm that computes these correlations would solve the #SAT problem [14], meaning that computing

Table I PROPAGATION RULES FOR STABLE AND TRANSIENT CORRELATION SETS


correlations is at least #P-Complete [29], which is harder than NP by defnition. Because of the structure of secrets and the uniform randomness of secret shares and masks, it is suffcient to track the correlations to linear combinations of the inputs [4]. Furthermore, the correlations yield a sound over-approximation that reduces the complexity of the problem and is also used in COCOALMA. In the following sections, we describe this over-approximation and its implementation, but refer to the soundness proofs in the original publication [4].

#### *A. Correlation Sets*

Instead of painstakingly computing the exact correlation factor for each linear combination of inputs, COCOALMA over-approximates the correlations. In particular, COCOALMA only considers whether the correlation factor is non-zero, and ignores its exact value. All linear combinations a gate correlates to are grouped together and tracked as so-called *correlation sets*. The exact correlations are approximated using propagation rules that determine the correlation set of f = a⊙b by considering the correlation sets of a and b, as well as the used logic operation ⊙. Using the previous example f = a ∧ b, we have shown that the correlation set contains all linear combinations of a and b, *i.e.*, {⊥, a, b, a ⊕ b}. In contrast, f = a ⊕ b only correlates with itself, *i.e.*, the set {a ⊕ b}, because the value of a ⊕ b coincides with ⊥, a, and b in exactly half of the cases, yielding a correlation factor of zero. Consequently, knowing f would not reveal any information about a and b. In general, we cannot compute the correlation set of the output of a logical operation precisely from the correlation sets of its inputs, so COCOALMA overapproximates these sets.

Table I presents the propagation rules COCOALMA uses to compute the correlation sets of a gate using its inputs. The propagation rules defne two kinds of correlation sets necessary for the verifcation: (a) *stable* sets S t f that defne the normal behavior of a gate f, and (b) *transient* sets T t f that defne the behavior of f in the presence of glitches and transition leakage effects. Both types of correlation sets are defned for each clock cycle t, as gates change their value over time. Although the hardware probing model only talks about these transient correlation sets, the stable correlation sets are necessary for synchronizing elements such as registers. For simpler exposition and encoding, Table I shows the computation of correlation sets using the operators ⊗ and ⟨·⟩. Here, ⊗ is the element-wise exclusive-or between two correlation sets, *i.e.*, X ⊗ Y = {x ⊕ y | x ∈ X, y ∈ Y }. The operator ⟨·⟩ adds a correlation with ⊥ to a correlation set, *i.e.*, ⟨X⟩ = X ∪ {⊥}.

The presented propagation rules are based on COCOALMA's original publication [13], [4] but were adapted for stateful hardware verifcation with continuously recording probes. Naturally, constants only correlate to ⊥, and negations only change the sign of the correlation but do not impact the correlations themselves. As discussed previously, linear gates only correlate to the linear combination of the inputs, so the correlation set is computed as the element-wise exclusiveor of the inputs' correlation sets. For non-linear gates, the correlation set is computed similarly, only that in this case, a bias is introduced in each input's correlation set. Using the introduced notation, the correlation set of gate f = a ∧ b, where a and b are inputs, is computed as

$$
\langle \{a\} \rangle \otimes \langle \{b\} \rangle = \{\bot, a\} \otimes \{\bot, b\} = \{\bot, a, b, a \oplus b\} \text{ .}\tag{1}
$$

For transient correlations, linear gates behave like non-linear gates. Glitches induced by different signal timings can force a gate to forward a constant or either of the inputs, in addition to the correct correlations. A multiplexer correlates to both of its data inputs a and b, as well as their linear combinations with the selector c, *i.e.*, a⊕c and b⊕c. For the transient correlation set, COCOALMA assumes that all three input signals can be combined non-linearly.

When verifying masked software running on a processor, the input pins of the hardware design are not relevant, as they are part of the micro-architecture and not visible to the programmer. Secret shares, masks, and public values are all stored in both the RAM and the ROM, and for the verifcation process, we label their locations and simulate the design to execute a program [13]. Verifying masked hardware is different, as there are no such memory blocks, and the registers get cleared with a reset signal. Computation-relevant data, such as plaintexts, keys, and masks, is provided by the environment through the input ports of the circuit. Therefore we extend COCOALMA with support for input ports and introduce an appropriate propagation rule, which states that an input port only correlates to its value in cycle t. In our implementation, *public* values, *shares*, and *masks* have the same value throughout the execution of the circuit. However, input ports labeled as *random* are provided by an external *random number generator* and change their value in each cycle, and therefore, the correlation set also changes each cycle. In addition, to the support for input ports, we also optimized the propagation rules for registers. Since the probes in the *hardware probing model* record data continuously, we do not need to account for transition leakage because all values passing through a wire are recorded anyway.

Computing correlation sets from other correlation sets can result in over-approximations that include non-existent correlations. For example, representing the exclusive-or function f = a ⊕ b as f = (a ∧ ¬b) ∨ (¬a ∧ b) would result in the spurious correlation set {⊥, a, b, a⊕b}, when in reality f only correlates with {a ⊕ b}. This means that a hardware designer applying this over-approximative method must be aware of false leakage reports and debug them properly. Oftentimes, as illustrated in this toy example, the over-approximative error can be fxed by either re-writing the circuit or removing the problematic correlation term from the correlation set.

However, despite being imprecise, this over-approximation is easy to encode and retains some useful information. For example, function f = (a ⊕ b) ∧ c is correctly claimed to correlate with {⊥, c, a ⊕ b, a ⊕ b ⊕ c}, even though the correlation set of f was computed using the correlation sets of g = a ⊕ b and c. This result refects the intuition that we cannot "remove" masking from a signal by combining it with another value, *i.e.*, the correlation set does not contain values where a appears without b.

#### *B. SAT Encoding*

The upper bound for the size of the correlation sets is exponential in the number of inputs, so COCOALMA cannot store or enumerate them explicitly and instead relies on an implicit encoding method that utilizes a SAT solver. While the used encoding is similar to the one presented by Bloem et al. [4], it was signifcantly optimized and streamlined in COCOALMA to simplify the implementation of all the propagation rules in Table I. As mentioned previously, the user needs to label each input port p ∈ I as either a *share* s ∈ K<sup>i</sup> of the i-th secret, a fxed random *mask* m ∈ M, a *random port* with a new value r ∈ R<sup>t</sup> in each clock cycle t, or a public value that is ignored. For simpler notation, we do not implicitly associate correlation sets or propositional variables with clock cycles or gates in the circuit, and instead specify them with C<sup>−</sup> and P−, where the subscript is used to differentiate them. In our SAT encoding, a correlation set C<sup>x</sup> is represented by a set of propositional variables P<sup>x</sup> = {x<sup>p</sup> | p ∈ I}, such that every valid assignment to the propositional variables P<sup>x</sup> corresponds to an element in the correlation set Cx. Additionally, just like I, P<sup>x</sup> can be further split as P<sup>x</sup> = ⋃ <sup>i</sup> K<sup>i</sup> <sup>x</sup> ∪ M<sup>x</sup> ∪ ⋃ <sup>t</sup> R<sup>t</sup> x . Example 1 gives an intuition of the introduced variable sets and correlation set encoding.

*Example 1:* Let I = {s0, s1, m} be the labeled input ports given by the user, where s = s<sup>0</sup> ⊕ s<sup>1</sup> is a secret with shares K<sup>0</sup> = {s0, s1}, and fxed uniformly random masks M = {m}. Let C<sup>x</sup> = {⊥, s1, s<sup>0</sup> ⊕ m, s<sup>0</sup> ⊕ s<sup>1</sup> ⊕ m} be a correlation set. Then P<sup>x</sup> = {x<sup>s</sup><sup>0</sup> , x<sup>s</sup><sup>1</sup> , xm} are the propositional variables used for encoding Cx, where K<sup>0</sup> <sup>x</sup> = {x<sup>s</sup><sup>0</sup> , x<sup>s</sup><sup>1</sup> }, and M<sup>x</sup> = {xm}, and there are no random ports. The propositional variables in P<sup>x</sup> are constrained in such a way that the only satisfying assignments for the propositional tuple (x<sup>s</sup><sup>0</sup> , x<sup>s</sup><sup>1</sup> , xm) are (⊥, ⊥, ⊥), (⊥, ⊤, ⊥), (⊤, ⊥, ⊤), and (⊤, ⊤, ⊤). These assignments represent the elements of Cx, where x<sup>p</sup> indicates whether the port p appears in the current term of Cx.

COCOALMA maps the correlation terms in C<sup>x</sup> to satisfying assignments to the propositional variables P<sup>x</sup> by translating the propagation rules from Table I into satisfability constraints. However, in order to simplify the exposition, we only demonstrate how we encode the correlation set operations ⟨·⟩, ∪, and ⊗, as well as the creation of a correlation set with only one element. All of the propagation rules from Table I can be obtained by applying different combinations of these individual encodings, e.g., the transient rule for linear gates is obtained by combining the encodings of ⟨·⟩ and ⊗.

First off, the correlation set of an input port only contains the port itself. Therefore, we restrict all of its propositional variables that correspond to other ports to be ⊥, whereas the propositional variable representing the port itself must be set to ⊤. More precisely, for a port p in clock cycle t, the propositional variables P<sup>x</sup> are constrained with

$$x\_{p^t} \land \bigwedge\_{x\_a \in \mathcal{P}\_x, a \neq p^t} \neg x\_a \,, \tag{2}$$

where only *random* input ports are different in each clock cycle and p = p t in all other cases.

Extending a correlation set C<sup>x</sup> with the ⊥ element, written as ⟨Cx⟩, is required for the propagation rules of linear and non-linear operations. When translating this into constraints for propositional variables Px, COCOALMA introduces a new set of variables P ′ x and a fresh propositional variable q. The SAT solver can pick the value of q freely. Depending on the choice, all propositional variables P ′ x are forced to equal their corresponding variables in P<sup>x</sup> or forced to be ⊥. We write this constraint as

$$\bigwedge\_{x\_a \in \mathcal{P}\_x, x\_a' \in \mathcal{P}\_x'} x\_a' \leftrightarrow (q \land x\_a) \,. \tag{3}$$

All satisfying assignments of P ′ x correspond to elements of the correlation set ⟨Cx⟩. Each time the propagation rules in Table I use the ⟨·⟩ operator, we introduce the variables P ′ x and q and apply the given constraint.

Encoding the propagation rule for multiplexers requires a similar constraint when representing the union of two correlation sets. Given the correlation set C<sup>z</sup> = C<sup>x</sup> ∪ Cy, we introduce corresponding propositional variables P<sup>z</sup> and a fresh propositional variable q. We subsequently constrain the introduced propositional variables with

$$\bigwedge\_{z\_a \in \mathcal{P}\_z, \, x\_a \in \mathcal{P}\_x, \, y\_a \in \mathcal{P}\_y} z\_a \leftrightarrow \left( (q \wedge x\_a) \vee (\neg q \wedge y\_a) \right), \quad (4)$$

where whenever q = ⊤ an element of C<sup>x</sup> is encoded, and otherwise an element of Cy. This encoding ensures that C<sup>z</sup> contains all elements of C<sup>x</sup> and Cy, even if they are duplicates.

Finally, COCOALMA encodes the element-wise exclusive-or of two correlation sets C<sup>z</sup> = C<sup>x</sup> ⊗ C<sup>y</sup> using their corresponding propositional variables and a straightforward equivalence encoding

$$\bigwedge\_{z\_a \in \mathcal{P}\_z, \, x\_a \in \mathcal{P}\_x, \, y\_a \in \mathcal{P}\_y} z\_a \leftrightarrow (x\_a \oplus y\_a) \;. \tag{5}$$

Unlike the encoding of unions, no additional fresh propositional variables are necessary as there is no choice involved.

The constraints (2)-(5) only show how each of the propagation rules shown in Table I can be translated into SAT.

COCOALMA needs an additional encoding for the conditions under which information leakage occurs. With correlation sets, we check whether there is an element of the correlation set where all shares of a secret are present, without being hidden by uniformly random values, such as fxed masks, random input ports, or shares of other secrets. Looking back at Example 1, we see that each time both shares s<sup>0</sup> and s<sup>1</sup> appear in a correlation term, they are masked by mask m. This means that the correlation set does not leak information about s = s<sup>0</sup> ⊕ s1. When checking this leakage property using the SAT encoding, we require two constraints.

First, we enforce that for each secret, either all shares are active, or all shares are inactive. Furthermore, we say that at least one secret must be active in order to have a leak. We encode this property by introducing one fresh propositional variable k<sup>i</sup> for each secret and constraining them with

$$\left(\bigvee\_{i} k\_{i}\right) \land \bigwedge\_{i} \bigwedge\_{x\_{s} \in \mathcal{K}\_{x}^{i}} k\_{i} \leftrightarrow x\_{s} \,. \tag{6}$$

The frst conjunct guarantees that at least one of the secrets is present in the correlation term. The rest of the expression ensures that either all shares of a secret are active in a correlation term, or none of them are, which is necessary since shares of incomplete secrets are uniformly random.

Second, we enforce that no masks appear in the correlation term, so the secrets are not *hidden* by uniformly random values, as discussed in Example 1. We represent this in the SAT encoding as

$$\left(\bigwedge\_{x\_m \in \mathcal{M}\_x} \neg x\_m\right) \wedge \left(\bigwedge\_t \bigwedge\_{x\_r \in \mathcal{R}\_x^t} \neg x\_r\right),\tag{7}$$

which ensures that a satisfying solution must assign all the variables representing masks and random values with ⊥.

Constraints (6) and (7) go hand in hand, and both are required when testing whether a given correlation set leaks information about the secrets. When checking the security of a circuit in one of the supported security models, COCOALMA determines the observations an attacker can make, where each observation is made up of multiple correlation sets. For the *software probing model*, COCOALMA takes all the d-tuples O of probing locations (g, t) and tests the non-linear combination of their stable correlation sets

$$\bigotimes\_{(g,t)\in\mathcal{O}} \left< S^t\_g \right>\,,\tag{8}$$

where g is the chosen gate, and t is the chosen clock cycle. The same applies to the *time-constrained probing model*, where COCOALMA checks the transient correlation sets T t g instead. In contrast, for the full *hardware probing model*, the probing locations O are a d-tuple of gates g instead, and concern all the clock cycles t for the given gates. Therefore, COCOALMA must check the correlation set

$$\bigotimes\_{g \in \mathcal{O}} \bigotimes\_{t} \left< T\_{g}^{t} \right> \,, \tag{9}$$

which signifcantly increases the observations an attacker can make. For example, using a register to store one share of a secret early in the computation and store the other share later in the computation would still allow an attacker to reconstruct the secret. Naturally, longer executions of a circuit get progressively harder to verify.

#### *C. Encoding Optimizations*

Although the shown SAT encoding is suffcient for showing whether the circuit leaks information about the processed secrets, the size of the produced constraints and formulas is unnecessarily large. In this section, we present some of the optimizations that dramatically reduce the effort of showing that a masked hardware circuit is secure.

Variable elimination. The sets of propositional variables P<sup>x</sup> often include variables constrained through unit clauses, so their assignment is predetermined and equal in all satisfying solutions. Constraint (2) is an example of such a situation. Building constraints for such variables is unnecessary, and they can be removed entirely, substantially reducing the size of formula given to the SAT solver. In practice, COCOALMA implements this by storing P<sup>x</sup> as a dictionary of propositional variables, as well as a set of variables trivially set to ⊤. All variables from P<sup>x</sup> that are not present are known to have the value ⊥. Consequently, whenever creating any of the shown constraints (3)–(7), we frst check for trivial simplifcations using the properties of logic operators. Although this optimization might seem superfcial, it single-handedly reduces the number of variables and clauses by anywhere between 90% and 98% for the probing verifcation problems we have investigated so far. Notably, this optimization does not reduce the complexity of the queries given to the SAT solver, as solvers usually detect unit clauses anyway, but instead signifcantly reduces the memory consumption. Without this optimizations, verifying the probing security of longer executions would not be possible because the formula would not ft into memory.

Covering sets. Due to the nature of the propagation rules from Table I, some correlation sets are supersets of others. Take the propagation rules for non-linear gates as an example. For gate f = a ∧ b, the stable correlation set is computed as S t <sup>f</sup> = ⟨S t a ⟩⊗⟨S t b ⟩ = {⊥}∪S t <sup>a</sup>∪S t <sup>b</sup>∪(S t <sup>a</sup> ⊗ S t b ), which implies that S t <sup>a</sup> ⊆ S t f and S t <sup>b</sup> ⊆ S t f . Consequently, it is suffcient to perform the security checks for S t f , ignoring both S t a and S t b because their elements are already *covered*. For element-wise exclusive-or operations like C<sup>z</sup> = C<sup>x</sup> ⊗ Cy, the resulting set C<sup>z</sup> covers C<sup>x</sup> whenever ⊥ ∈ Cy, and C<sup>y</sup> whenever ⊥ ∈ Cx. It turns out that in the *software probing model*, we only need to check gates that are inputs to XOR gates, selectors of a multiplexer, inputs to a register, and circuit outputs. In the *time-constrained probing model*, we only check register inputs and circuit outputs because in that model linear gates behave non-linearly due to glitches. In the full *hardware probing model*, the covering properties are slightly more complex, and we check all gates that have at least one clock cycle where another gate does not cover them.

Table II SIMPLIFICATION RULES FOR STABLE CORRELATION SETS


#### IV. SIMULATIONS

Although the method presented in Section III is suffcient to check the security of a masked implementation in the supported probing models, it does not consider how the control signals change over time. As mentioned in the introduction, COCOALMA uses simulations to obtain information about the exact values of control signals and subsequently uses them to simplify the correlation sets accordingly.

In the hardware probing model, all values marked as *sensitive*, *i.e.*, secret shares, mask registers and random input ports, are assumed to be uniformly random. This is a requirement for the execution environment, in this case the testbench, which performs the secret sharing steps and includes a random number generator that drives the random input ports in each clock cycle. In any reasonable probing model, the attacker can only control the values of un-shared plaintext values, and we assume they can request an unlimited number of encryptions for the DPA attack. If the attacker were able to mess with the random number generator of the environment, they would be able to break any conceivable masking scheme, so this is out-of-scope in the hardware probing model.

Other input signals, such as control signals, which marked as *public* are assumed to be independent of the secrets and masks processed in the hardware circuit, so their values can be taken directly from a circuit simulation. Since their values are known, COCOALMA uses them to perform simplifcations while applying the propagation rules. Consider the gate f = a ∧ b, where a is a public value and b has a correlation set Cb. Because COCOALMA knows the value of a, f is simplifed accordingly. If a = ⊥, then we know that f = ⊥ independently of b, meaning that f is also a public value and does not need a correlation set. Similarly, if a = ⊤, we know that f = b, and we can reuse the correlation set as C<sup>f</sup> = Cb. Table II defnes analogous simplifcations for all propagation rules with multiple inputs when the constant signal is stable. Using the simulated execution of the circuit and the labeling provided by the user, each gate g at each clock cycle t is classifed as either being a control signal or having a correlation set, but never both. Empty entries in Table II indicate that the gate does not have a correlation set and is instead declared a control signal.

#### *A. Signal Stability*

Unlike with stable correlation sets, applying simplifcations based on the simulation trace is not straightforward for transient correlation sets, where COCOALMA must also consider

Table III SIGNAL STABILITY COMPUTATIONS


glitches. Glitches are hardware phenomena that behave like temporary faults while switching values. A gate f = a ⊙ b will pass on a's value if its signal arrives at f before the new signal of b. After both signals arrived, the fault is corrected, and f becomes the value it is supposed to have. Ultimately, the signal must be stable at the end of a clock cycle, when the clock triggers the registers and synchronizes the computation.

However, there are certain conditions when a gate cannot experience a glitch, e.g., when the values a and b come directly out of a register and do not change from the previous clock cycle. In that particular case, even though the signal timings are different, the value transmitted through the wires did not change the entire time, and no glitching is possible. As a result, even the signal produced by f would be stable and glitch-free. This property recursively propagates throughout the whole circuit and allows us to determine which values can be used for the simplifcations shown in Table II, even for transient correlation sets.

COCOALMA uses the concrete values of a simulation trace to determine the glitching behavior of public values such as control signals. Assume the same situation as before, with f = a∧b, where a is a public value and b might correlate with masks or shares, and thus, has a correlation set Cb. Knowing whether f can forward b is crucial, as it might lead to an information leak in a later part of the circuit. If a = ⊥ and its signal is stable, meaning it cannot produce glitches, then f is a public value with f = ⊥. Therefore, a being a stable public signal set to ⊥ effectively stops the propagation of a correlation set from b to f. In the rest of this section, we outline a recursive method for determining whether a signal is stable in a given clock cycle.

In the following exposition, we introduce three predicates that help defne the algorithm computing the signal stability. We use the *st*(x) predicate to say that the signal x is stable. The predicate *cr*(x) is true whenever the signal x is associated with a transient correlation set. Finally, predicate *vl*(x) represents the value of signal x taken from the execution trace. All three predicates also have a version that applies to the previous clock cycle: *st*′ (x), *cr*′ (x), and *vl*′ (x). The rules computing the stability of any given signal f are shown in Table III. All values of the predicates are computed directly, and none of them are given to the SAT solver.

First, all input ports are held stable by the environment. That is, another circuit that controls the input ports must keep

Table IV VERIFICATION RESULTS FOR TWO VERSIONS OF PRINCE-TI


their signals stable and avoid glitches. Since public signals and signals with correlation sets are mutually exclusive in COCOALMA, an input port is only considered stable when it does not have a correlation set. Similarly, the output of a register is stable if the register does not change its value from the previous cycle and does not have a correlation set associated with its input. If the value did change, we consider the signal unstable because it can cause glitches in gates connected to it during the clock-cycle transition. Linear gates such as XOR are only stable if both of their inputs are stable. If one of the inputs produces a glitch, then an XOR would forward it to all gates it is connected to since the other signal cannot stop it.

Non-linear gates such as AND (OR) can remain stable even if one of their inputs produces glitches. If at least one of the inputs of an AND (OR) gate is stable at ⊥ (⊤), then no change or glitch in the other input can make it unstable. Otherwise, the output of an AND (OR) gate is only stable if both of its inputs are also stable. The conditions under which a multiplexer is stable are similar. For instance, if selector c is stable with the value ⊤ (⊥), then the output of the multiplexer is stable if and only if the selected input a (b) is stable. In contrast, if selector c is not stable, the output is only stable if the inputs a and b are stable and have equivalent values.

#### V. CASE STUDIES

In this section, we investigate the probing security of the masked hardware implementations PRINCE-TI [6] and AES-DOM [16]. In particular, we analyze the complexity of verifying round-reduced versions in all three of the supported probing models. Additionally, we demonstrate how COCOALMA's debugging functionalities allow us to identify potential issues and fx them accordingly. All experimental results shown in Table IV were captured on a notebook with the Intel Core i7-8550U 1.8GHz CPU and 16 GiB of RAM.

#### *A. Verifying PRINCE-TI*

PRINCE is a state-of-the-art lightweight block cipher. It is designed with hardware implementations in mind, so that ideally, the entire encryption process can be done in one clock cycle [5] when no masking is applied. PRINCE takes as input a 64-bit plaintext block and encrypts it with a 128 bit key. The encryption process consists of two phases with six rounds each. In the frst phase, the frst round adds the round key onto the data block, whereas the other fve rounds apply a 4-bit S-Box, an affne transformation, and then mix the round key into the data block. After the frst phase, the data block is transformed using the 4-bit S-Box, another affne transformation, and the inverse 4-bit S-Box, before starting the second phase. In the second phase, each round applies the inverse operations performed in the rounds of the frst phase, meaning that the frst fve rounds add the round key, apply the inverse affne transformation followed by the inverse 4-bit S-Box. The last round of the second phase only adds the round key to the data block.

Unlike the unmasked version of PRINCE, the threshold implementation PRINCE-TI [6] cannot be completed in one clock cycle. This restriction is due to the re-sharing phase present in threshold implementations, which requires additional synchronization to prevent leakage caused by glitches. For frst-order probing security, the implementation splits all the plaintext and key bits into two shares and treats them as secrets. PRINCE-TI uses random inputs to re-share the outputs of its sixteen 4-Bit S-Boxes, where each S-Box requires twelve random bits. In the offcial implementation, this process is optimized in such a way that four S-Boxes share the same randomness, so the re-sharing only requires a total of 48 random bits.

The frst row of Table IV shows the results produced by COCOALMA, where 192 (*i.e.*, 128 key bits and 64 plaintext bits) pairs of ports are labeled as shares of secrets, and 48 ports are labeled as coming from a random number generator. The frst round of the cipher needs three clock cycles to complete since we frst need to load the inputs into internal registers and start the encryption. Within one second, COCOALMA has proven that the implementation is secure in the *software probing model* (SW), indicated with (✓) in Table IV. However, COCOALMA claims it found a leak (m) in the *time-constrained probing model* (TC) in the third clock cycle and provides us with debugging information.

#### *B. Debugging Information*

After fnding a leak in a hardware circuit, COCOALMA attempts to simplify the leaking correlation. For example, COCOALMA could report that the output of a gate correlates with the linear combination of many secrets. This information, while correct, is often not useful for a designer because looking through the implementation and tracking the data dependencies of so many secret bits is extremely cumbersome. Therefore, COCOALMA attempts to minimize the number of secrets in the leaking correlation term. In particular, we go through all secret bits and greedily assume that the leaking correlation term does not contain them but still leaks information. If the SAT solver returns UNSAT, we know that the investigated secret must appear in the correlation term. At the end of this procedure, COCOALMA has produced a minimized example of a leaking correlation term.

Next, COCOALMA provides a *leakage graph*, which allows the designer to visualize the structure of the leaking part of the circuit. In particular, the leakage graph highlights the leaking gates and only includes gates that infuence the leak. We perform this graph minimization by starting at the leaking gates and computing their *cone of infuence*.

Figure 2. The PRINCE-TI leakage found with COCOALMA. Signal names are shown on top of lines, whereas the problematic correlation term or signal stability is shown below.

Finally, COCOALMA produces a *leakage trace* where the correlation terms of all relevant correlation sets are displayed. In particular, we take the model produced by the SAT solver and show the ports p ∈ I whose corresponding propositional variables in P<sup>x</sup> are assigned to ⊤, indicating they are part of the correlation term. The designer can combine this information with the leakage graph to deduce the cause of the leak.

#### *C. Debugging PRINCE-TI*

In the particular case of PRINCE-TI, we have identifed the leak at multiplexer mux1\_out2[1], as shown in Figure 2. Here, the control signal sel1 determines whether the output is the inverse of the shift rows operation inv\_sr\_out2[1], or the compression operation comp\_sh2[1]. Here, a glitch on the control signal sel1 causes the multiplexer to forward both inputs in the third clock cycle. Unfortunately, inv\_sr\_out2[1] correlates to the uniformly random value r = i\_r[3]⊕i\_r[4]⊕i\_r[5], whereas comp\_sh2[1] correlates with r ⊕ i\_pt[1]⊕i\_key[1]⊕i\_key[65]. Observing these two values allows an attacker to compute i\_pt[1]⊕i\_key[1]⊕i\_key[65], breaking the security guarantees promised by masking schemes.

Although the leakage is observable at mux1\_out2[1], its root cause is somewhere else. Under closer inspection of the *leakage trace* and *leakage graph*, we see that the shift rows operation, in combination with glitches, causes a forwarding of the random bits used to re-share the thirteenth S-Box, making them observable at inv\_sr\_out2[1]. Since the same random bits are used to re-share the frst S-Box, which eventually leads to comp\_sh2[1], the random bits cancel out at the multiplexer. Ultimately, the reuse of random bits causes a leak in the presence of glitches. We fx this by increasing the size of the random input i\_r from 48 to 192 bits, and avoiding the reuse of random inputs for the re-sharing of S-Box outputs. The second and third row of Table IV show the verifcation results for the fxed version of PRINCE-TI, where we were able to verify up to two rounds of the cipher in under four minutes.

#### *D. Verifying AES-DOM*

Rijndael, better known as the *Advanced Encryption Standard* (AES), is an extremely popular, secure, and widely adopted block cipher [8]. The 128-bit version of AES takes as input a 128-bit plaintext and encrypts it through ten rounds using a 128-bit key. First, the cipher adds the initial secret key to the plaintext to create the cipher's state and then expands the key into ten individual round keys. The frst nine rounds apply the S-Box to each state byte, re-order the bytes, apply a linear transformation to 32-bit chunks, and mix the state with the round key. The last round does not apply the linear transformation as it does not contribute to security.

AES is not intended for masked implementations because it has a highly non-linear S-Box that is applied sixteen times per round. In order to minimize the used design area, masked AES implementations opt for only one S-Box module that is sequentially fed new bytes each clock cycle [25], [16].

We have analyzed the probing security of the DOMprotected [16] implementation of AES by Gross et al. in all three security models. The open-source implementation of AES-DOM<sup>4</sup> is written in VHDL and not in Verilog, so it is not directly compatible with our verifcation fow. However, due to the modularity of COCOALMA, we can produce a netlist with another synthesis fow, e.g., GHDL<sup>5</sup> , and extend it with a compatibility wrapper in Verilog so we can use Verilator for the *tracing* step of the original verifcation fow depicted in Figure 1. Although this is convenient, it is not strictly required, and COCOALMA also supports execution traces produced by other simulators in VCD format.

Executing the frst round of the cipher requires one cycle of setup and twenty computation cycles. Notably, because of the parallelism in hardware designs, AES-DOM computes the linear operations of the frst round just-in-time for their use as S-Box inputs in the second round. Therefore, the frst 21 cycles only include the key addition, sixteen S-Box applications, and the byte re-ordering. The implementation processes 256 secrets, that is, 128 key bits and 128 plaintext bits. In each clock cycle, the AES-DOM consumes 46 uniformly random bits, yielding a total of 966 random bits for the frst round of the cipher. The last column of Table IV shows the verifcation results for the frst round of AES-DOM. The verifcation was successful in all three probing models, and since the AES-DOM implementation is more complex than PRINCE-TI, it naturally takes longer to verify. COCOALMA only takes about three hours to verify that the implementation of AES-DOM is secure in the hardware probing model.

### VI. RELATED WORK

The formal verifcation of power analysis countermeasures is a well-established research feld [1], [2], [4], [13], [10], [11], [19]. The community has been investigating two fundamentally different principles. On the one hand, there are approximative methods like those used in REBECCA [4], maskVerif [2], and COCOALMA. In contrast to REBECCA and COCOALMA, maskVerif opts for a language-based verifcation approach, tracks the symbolic representation of probing locations, and simulates the observations an attacker can make using uniformly random values. On the other hand, model counting methods inspect the truth table of a given function and check whether the correlation strength is zero for all secret values. Tools such as QMVerif [10] and QMSInfer [11] apply these methods to overcome the shortcomings of heuristics used in faster approximative methods. Similarly, probability-distribution tracking approaches such as SILVER [19] (implicitly) rely on model counting to determine the distribution type for any possible observation an attacker can make.

To our knowledge, maskVerif and SILVER were not used for stateful hardware verifcation. The authors of QMVerif and QMSInfer claim they support stateful hardware verifcation, but the tools are not open-source, so we could not replicate their results.

# VII. FUTURE WORK

The current version of COCOALMA is a signifcant improvement over its predecessor REBECCA [4]. However, there are still open questions that could yield performance improvements or usability improvements.

The model of glitches used in COCOALMA seems too conservative, but we have no empirical evidence to the contrary. In particular, we assume that glitches are unpredictable and can forward any combination of the new and old signal values, even constants. This assumption might be too strict, and some combinations would not be observable in a power trace. Similarly, we assume the worst-case interaction between transition and glitch leakage, which might also be unnecessarily cautious. Eliminating these overly paranoid precautions would single-handedly reduce the verifcation complexity. Another avenue for increasing the scalability would be to consider implementation modules separately and tie the individual proofs together using composability notions [2].

### VIII. CONCLUSION

Although COCOALMA was originally designed for verifying software in the *time-constrained probing model*, it can also verify stateful hardware circuits in the *hardware probing model*. COCOALMA improves upon REBECCA in terms of scope and verifcation capabilities. It supports more security models, includes an elegant correlation-set encoding, supports circuit simulation, and uses it throughout the verifcation. The native support for stateful verifcation allows a tighter integration into the design fow, and as demonstrated with PRINCE-TI and AES-DOM, COCOALMA can be applied to industry-scale designs. We have successfully identifed a leakage location in PRINCE-TI, which cannot be found by only analyzing the PRINCE-TI S-Box, as it requires the full context of the cipher's implementation. Through the debugging support provided by COCOALMA, we found the cause of the information leakage and fxed it by adding more random inputs. Furthermore, we have also demonstrated the modularity and adaptability of COCOALMA by verifying an AES-DOM design that uses an entirely different synthesis fow in another HDL language.

Overall, we think COCOALMA is an excellent addition to any synthesis fow and can be used for the early detection of mistakes.

<sup>4</sup>https://github.com/hgrosz/aes-dom

<sup>5</sup>https://github.com/ghdl/ghdl-yosys-plugin

#### REFERENCES


# End-to-End Formal Verification of a RISC-V Processor Extended with Capability Pointers

Dapeng Gao University of Oxford

Tom Melham University of Oxford

*Abstract*—Capability Hardware Enhanced RISC Instructions (CHERI) extend conventional ISAs with *capabilities* that can enable fine-grained memory protection and scalable software compartmentalisation. CHERI-RISC-V is an extended version of the RISC-V ISA with support for CHERI, and Flute is an open-source 64-bit RISC-V processor with a five-stage, inorder pipeline. This case study presents the formal verification of CHERI-Flute, a modified version of Flute that implements CHERI-RISC-V, against the Sail CHERI-RISC-V specification. To the best of our knowledge, this is the first extensive formal verification of a CHERI-enabled processor.

We first translated relevant portions of the Sail CHERI-RISC-V specification to SystemVerilog Assertions. Then we formulated and proved four classes of end-to-end correctness properties about CHERI-Flute, covering the CHERI instructions and certain liveness properties about the entire processor. None of these results are routine—they all rely on novel proof engineering methodologies that extract microarchitectural invariants to serve as lemmas for the end-to-end proofs.

This work exposed several previously-unknown bugs in CHERI-Flute, most of which occur in the implementation of sophisticated combinational logic for certain CHERI instructions.

# I. INTRODUCTION

Despite decades of hardening and mitigation efforts—such as stack protection, garbage collection, and virtualisation memory safety issues remain a common and dangerous source of security vulnerabilities. A 2019 report by Microsoft [1] states that '70% of the vulnerabilities addressed through a security update each year continue to be memory safety issues'. The root cause of this phenomenon is the pervasive use of an unsafe memory model for interpreting the C programming language [2]. This model can be traced back to the PDP-11 and presumes that memory is simply a linear array of individually addressable bytes. This has induced a number of deeply ingrained assumptions about pointer behaviour that go beyond what is guaranteed by the C specification and rely only on 'implementation-defined behaviour'.

The Capability Hardware Enhanced RISC Instructions (CHERI) project offers an alternative model that provides better memory safety [3]. Its main features include a new machine representation of C pointers called *capabilities*, and extensions to existing instruction set architectures (ISA) that enable the secure manipulation of capabilities. For intuitive understanding, capabilities can be regarded as traditional pointers with extra properties that make them more like object references in a memory-managed language, such as Java. On one hand, this model continues to support limited arithmetic operations on capabilities that, for example, allow a loop to iterate through an array by repeatedly incrementing a capability. On the other hand, it makes it impossible to construct arbitrary capabilities that can be dereferenced—a significant departure from the usual 'unsafe' understanding of the C programming language.

Well-developed ISAs that integrate capabilities include CHERI-RISC-V and CHERI-MIPS [4], which are extended from RISC-V and MIPS. Rigorous engineering techniques have been used extensively in their development [5]. Specifically, Sail [6] specifications of these CHERI ISAs exist that give a precise and executable definition to each instruction.

This case study explores the formal verification of an open source implementation of CHERI-RISC-V. Flute is a 64-bit RISC-V processor with a five-stage, in-order pipeline [7] released by Bluespec Inc. in late 2018. Researchers at Cambridge University have extended Flute with support for CHERI-RISC-V [8], and this extended implementation, named CHERI-Flute, was our verification target.

#### *A. Contributions*

We have verified several classes of properties for CHERI-Flute using the JasperGold formal verification environment [9]. The scope of our verification comprises the correct execution of all 80-plus CHERI instructions as well as certain liveness properties for the processor as a whole. Our proof does not cover the existing RISC-V instructions, which do not involve capabilities. Formal verification methodologies for these instructions are well-established and so they are not of central interest in this case study.

To the best of our knowledge, this is the first extensive formal verification of a CHERI processor implementation. Our aim in this paper is to make the methodology accessible for future verification projects on novel architectures, including ones that target capability hardware. All our verification code is available open-source [10].

We have deliberately taken an end-to-end approach. That is, properties are proved for the entire core, as opposed to individual components such as the individual execution units. In CHERI-Flute, the hardware that deals with capabilities is novel, complex, and distributed across the pipeline stages. Our end-to-end approach avoids the necessity to isolate this hardware and characterise its environment.

Our verification results all rely on novel proof engineering methodologies that extract microarchitectural invariants to serve as lemmas for the end-to-end proofs. Some of these

Fig. 1. A typical pointer represented by a capability

invariants are of interest in themselves. For example, one of them shows that the core can never create a malformed capability—an important consistency invariant.

This case study exposed several previously-unknown bugs in the implementation of CHERI-Flute, which have all been reported to and confirmed by the designers [11], [12], [13]. Most of these bugs occur in the implementation of sophisticated bit manipulation logic for CHERI-related instructions, demonstrating the effectiveness of formal verification in catching subtle bugs in a novel processor design. In some cases, we have been able to provide verified bugfixes to the designers.

#### II. BACKGROUND TO CAPABILITY ARCHITECTURE

CHERI extends ISAs with a new hardware representation for pointers and new instructions for manipulating them. See [4] for its full specification and [14] for a high-level summary of the large research effort surrounding CHERI.

Instead of using 32- or 64-bit integers to represent pointers, CHERI uses a richer representation called *capabilities* that can be stored in *capability registers* in the core or in *capability-sized* and *capability-aligned words* in the memory. The program counter, which usually holds integer addresses, is replaced by the *program counter capability* (pcc).

A capability, illustrated in Fig. 1, contains additional information compared to a traditional pointer, most notably including the following.


CHERI instructions operate on capabilities in accordance to security principles such as *privilege minimisation*, *monotonicity*, and *provenance*; these are enforced by checking the Validity Tag, Permissions, Bounds, and other information attached to capabilities [4]. For example, only a valid capability, with

Fig. 2. Pipeline of Flute, including forwarding paths

permission to load, and whose address is within its bounds, can be used to load from that memory address. Otherwise, the processor traps and potentially causes the program to crash. The checks performed by each CHERI instruction are known as its *guard conditions*, and the correctness of their hardware implementation is crucial to the security protections provided by CHERI.

#### III. BASICS OF CHERI-RISC-V

CHERI-RISC-V extends the RISC-V ISA with support for CHERI [4]. This case study treats its 64-bit variant.

#### *A. Compression of Capabilities*

When stored in memory, capabilities are represented in a compressed format [4], [15]. A compressed capability in 64-bit CHERI-RISC-V takes 128 bits (plus an out-of-band validity tag bit)—twice as many bits as a traditional pointer. In the capability registers of the core, however, they are represented in a decompressed format that occupies even more bits. Decompression and compression are done transparently when they are moved between memory and the core.

Capability compression is lossy. That is, there exist decompressed capabilities that do not correspond to any compressed capability. These decompressed capabilities are termed *unrepresentable*. Such a capability poses a significant problem if it appears in the core, since there is no well-defined way to store it to the memory—as that would require compressing the capability first. Part of our verification is to show that unrepresentable capabilities can never be created by the processor.

#### *B. Sail CHERI-RISC-V Instruction Specification*

The definition of each CHERI instruction in the Sail CHERI-RISC-V specification [16] roughly takes the form of Algorithm 1. An instruction can retire either *unsuccessfully*, due to violations of one of its guard conditions, or *successfully*, after modifying the architectural state of the processor. As will be seen in Section V-A, the distinction between successful and unsuccessful retirement is central to the way we specify instruction correctness in this work.

#### IV. FLUTE AND CHERI-FLUTE

Flute [7] is a 64-bit RISC-V processor with a five-stage, inorder pipeline designed for low- to medium-end applications. The processor is designed in Bluespec SystemVerilog (BSV) and has been synthesised and tested on Xilinx FPGAs.

Flute has the basic pipelined microarchitecture commonly found in computer architecture textbooks [17], featuring a

#### Algorithm 1: Typical CHERI instruction specification


Fetch (F), a Decode (D), an Execute (E), a Memory (M), and a Write-back (W) stage. It also comes with forwarding mechanisms to make the pipeline more efficient. The register file (regfile) consists of 32 general-purpose registers r0, . . . , r31, where r<sup>0</sup> is hardwired to zero.

Fig. 2 illustrates the pipeline of Flute with its stages occupied by instructions I1, . . . , I5. Outgoing paths from stage M and W, including forwarding paths, are highlighted in red and blue respectively. These paths carry information about pending updates to the register file: the pending update in stage W writes the value v<sup>W</sup> into register rd<sup>W</sup> , and the pending update in stage M writes the value v<sup>M</sup> into register rdM.

To articulate properties, we define two subscripted register files: regfileM, which contains the contents of regfile after committing the pending update in stage W, and regfileE, which contains the contents of regfile after committing the pending updates in both stages W and M, in that order. The subscripted versions are essentially what the register file appears to be to stages M and E after forwarded values are taken into account. Hence their subscripts.

#### *A. CHERI-Flute*

CHERI-Flute [18] extends Flute with support for CHERI-RISC-V. We sketch here the main relevant changes.

First, the registers are widened to become hybrid registers that can be used as both integer and capability registers. Second, most of the computation supporting the CHERI instructions—calculating bounds, incrementing addresses, and so on—is implemented within the ALU located in stage E. Finally, circuitry is added to stage M that partially checks whether any CHERI instruction passing through it violates the instruction's guard conditions. The rest of the checks are performed earlier by the ALU. While these checks could in principle all be placed in the ALU, this would cause unacceptably long delays in stage E for certain instructions. Hence they are spread across stages E and M instead.

#### V. FORMULATING CORRECTNESS

Our formal verification flow is driven by JasperGold. The design is first compiled into SystemVerilog using the opensource bsc compiler and then imported into JasperGold. This pre-compilation is necessary because JasperGold cannot read the Bluespec SystemVerilog source of CHERI-Flute directly.

The specification for correctness, which in our case is the Sail CHERI-RISC-V specification, also needs to be mapped into properties—written as SystemVerilog Assertions (SVA) about the compiled SystemVerilog design. Tooling does not exist to achieve this automatically, so for this case study we manually translated those portions of the Sail specification necessary for the verification effort into SVA. This yielded more than 1000 lines of data structures and functions of SystemVerilog and almost 100 correctness properties in SVA. As these properties are about a compiled design, a certain amount of 'reverse engineering' was needed to identify the relevant signal names.

#### *A. The Instruction Specification Framework*

A RISC-V processor is simple enough to formulate correctness of its instructions in the classical, direct way that will be familiar from many examples in the literature.

Let α be an abstraction function that maps each microarchitectural state of CHERI-Flute to a CHERI-RISC-V architectural state. Write s I −→ s 0 to mean that a CHERI-Flute processor retires instruction I and thereby transitions from microarchitectural state s to microarchitectural state s 0 . Similarly, write S I −→ S 0 to mean that, according to the CHERI-RISC-V specification, executing instruction I alters the architectural state S to architectural state S 0 . Note that both transition relations are deterministic.

Now for the implementation of an instruction I to conform to specification, we require that

$$\forall s \; s'. \; s \stackrel{\mathcal{T}}{\longrightarrow} s' \implies \alpha(s) \stackrel{\mathcal{T}}{\longrightarrow} \alpha(s') \tag{1}$$

where s ranges over the reachable microarchitectural states of CHERI-Flute. The reachability of s is, of course, crucial; this is further discussed in Section VI-B.

Now the formulation Prop. (1) faces a significant practical challenge. A CHERI instruction can be retired either successfully or unsuccessfully—and, in the latter case, there are sometimes more than a dozen ways in which it can fail. So formulating correctness as in Prop. (1) will require a full specification of what the processor's behaviour, and the resulting architectural state, should be for each kind of failure. This would be ideal, but also greatly increases the effort of formulating the required properties.

We therefore formulate a weaker notion of correctness that greatly simplifies the properties, albeit at the cost of a less comprehensive verification. Define two checkmarked relations as follows. For any instruction I and microarchitectural states s and s 0 , the relation s <sup>I</sup><sup>X</sup> −→ s <sup>0</sup> holds iff s I −→ s 0 and instruction I is retired successfully. And for any instruction I and architectural states S and S 0 , the relation S <sup>I</sup><sup>X</sup> −→ S <sup>0</sup> holds iff S I −→ S 0 and all instruction I's guard conditions are met. Now, consider the property expressed by the proposition

$$\forall s \, s'. \, s \stackrel{\mathcal{TM}}{\longrightarrow} s' \implies \alpha(s) \stackrel{\mathcal{TM}}{\longrightarrow} \alpha(s') \tag{2}$$

which says that any *successful* retirement of instruction I occurs in compliance with the specification. Proving the stronger

Fig. 3. Microarchitectural state with register-only instruction

condition Prop. (1) shows the processor complies with the full specification indicated by Algorithm 1, which has numerous branches leading to different types of failures. Prop. (2) is a weaker condition but greatly simplifies the properties.

This simplified property cannot detect a faulty processor with incorrect *unsuccessful* retirement. That is, a processor that correctly prevents a certain CHERI instruction that violates its guard conditions from being retired at the end of the pipeline, but which nontheless produces an incorrect processor state according to the CHERI RISC-V specification. The property will, however, still detect processors with incorrect *successful* retirement. That is, processors that produce the wrong architectural state upon a CHERI instruction being retired the end of the pipeline, or processors that retire a CHERI instruction at the end of the pipeline that violates its guard conditions. This ensures that none of the security guarantees offered by CHERI is compromised. To see this, suppose for contradiction that Prop. (2) is true for some faulty processor which *incorrectly retires successfully* some instruction I, i.e., there exist s and s 0 such that the relation s <sup>I</sup><sup>X</sup> −→ s <sup>0</sup> holds but some of instruction I's guard conditions are not met. Consequently, by Prop. (2), the relation α(s) <sup>I</sup><sup>X</sup> −→ α(s 0 ) also holds. But this implies that all of instruction I's guard *are* are met, which contradicts the assumption. Section IX discusses ways to relatively easily obtain properties that reflect the stronger specification.

#### *B. Expressing Specifications as Properties*

For mechanised formal verification in JasperGold, it is of course necessary to articulate the intent of the abstract correctness condition described by Prop. (2) as a group of SystemVerilog expressions. In practice, this means


Note that expressing (i) means characterising when the instruction I *has retired successfully*. One of the contributions of our methodology is to observe that this can be tied to the detection of certain microarchitectural states. Note also that (ii) is much simpler than having also to define the architectural states resulting from every kind of unsuccessful retirement.

In practice, we have developed these properties in separate groups for each of three distinct classes of instructions that share common structure. The sections that follow explain these. In the actual proof code, a systematic scheme of

Fig. 4. Microarchitectural state with state abstractions

'property templates' is employed to makes it easy to create and manage almost 100 properties without having to maintain multiple copies of boilerplate code. It also allowed us to quickly implement and validate proof engineering ideas for a large batch of properties, improving research efficiency.

#### *C. Register-Only CHERI Instructions*

A register-only CHERI instruction computes a function of its operands and writes a result into a given register, causing a trap if any of its guard conditions is not met.

Recall from Section V-B that two expressions are needed to formulate the required correctness properties. To express (i), consider Fig. 3, which shows the microarchitectural state when some register-only instruction I<sup>1</sup> is in stage W. Denote this state by s and the state right after instruction I<sup>1</sup> is retired by s 0 . Since stage W is at the end of the pipeline, any instruction reaching stage W is retired at the end of the current cycle. Moreover, any instruction reaching stage W can no longer cause traps, so it is bound to be retired successfully. Conversely, if a register-only instruction is retired successfully, then it must have been in stage W just before its retirement. So s <sup>I</sup>1<sup>X</sup> −→ s 0 and (i) can be expressed simply by checking whether the given instruction is in stage W.

To express (ii), consider Fig. 4, which illustrates the microarchitectural state of CHERI-Flute in some state s that is about to successfully retire instruction I<sup>1</sup> and enter state s 0 , i.e., s <sup>I</sup><sup>X</sup> −→ s 0 . Hence α(s) and α(s 0 ) must give the architectural states right before and after instruction I<sup>1</sup> is retired. Then observe that


so (ii) can be expressed as a function of state s.

Given formulations of expressions (i) and (ii), the SVA property for a register-only instruction with register addresses rd and rs, and immediate data imm will say that if stage W contains an instruction with opcode OP, then


Where resultOP and guardOP are SystemVerilog functions translated from the Sail specification of the instruction with opcode OP that compute its write-back result and guard conditions respectively.

#### *D. Branching CHERI Instructions*

A branching CHERI instruction redirects the control flow and (optionally) saves the return address in a given register. Of course, it also has guard conditions to ensure that the updated pcc has the right Bounds and Permissions. This creates an opportunity to decompose what a branching instruction does into two operations: checking its guard conditions and (optionally) saving the return address, and (conditionally or unconditionally) redirecting the control flow.

The first of these is just what a register-only instruction does, so we can simply reuse the property template developed in Section V-C. So the rest of this section is devoted to formulating the correctness properties about the second operation.

First, it is necessary to briefly explain how the control flow is managed in CHERI-Flute. Initially, stage F fetches an instruction from fetch\_addr and predicts the address of the next instruction using the branch predictor. This predicted address (pred\_addr) is *by default* used as the next fetch\_addr, and it is also passed along the pipeline with the *currently* fetched instruction until it reaches stage E, where the ALU computes the correct address of the next instruction (next\_addr). The processor then compares the computed next\_addr with the pred\_addr it received. If the two addresses do not match, then a branch misprediction has occurred, and stage F has been fetching the wrong instructions and passing them along the pipeline. To rectify this, fetch\_addr is set to next\_addr, and all pipeline stages prior to stage E are flushed. Otherwise, if the branch prediction has been correct, no flushing is needed and fetch\_addr is updated in the default way.

Fig. 5 shows the microarchitectural state when some branching instruction I<sup>3</sup> is in stage E. To formulate the correctness properties about control flow redirection, the framework developed in Section V-A is slightly generalised. Specifically, if a branching instruction I is in stage E and a branch misprediction has occurred, then instruction I is now considered 'about to be retired successfully' insofar as control flow redirection is concerned, and it is now considered to have been 'retired successfully' after fetch\_addr is set to next\_addr. This gives the expression (i) discussed in Section V-B. As for expression (ii), the architectural states of the processor right before and after some branching instruction is retired successfully are taken from the values of fetch\_addr before and after that instruction is retired successfully, respectively.

#### *E. Memory CHERI Instructions*

A memory CHERI instruction loads from or stores to the memory using the capability (directly or indirectly) specified by its operands, causing a trap if any of its guard conditions is not met. What a memory instruction does can be decomposed into two operations: checking its guard conditions, and loading from or storing to the memory.

The correctness properties about the first operation can be formulated simply by reusing the property template developed in Section V-C. Hence this section focuses on formulating the correctness properties about the second operation.

CHERI-Flute is connected to the memory hierarchy through an interface consisting of several input and output ports, which must be properly used in order for the memory to function correctly. As with register-only instructions, a memory instruction I is about to be retired successfully when it is in stage W, after having sent and fulfilled its request to the memory in stage M. Thus, the correctness property should assert that before I is retired successfully, when it was in stage M, the memory interface had been properly used to fulfil what the specification requires of it. In our proof, SVA sequences are used to precisely specify the exact sequence of events that must have taken place when instruction I was in stage M.

Fig. 6 and Fig. 7 show how a memory *load* instruction I<sup>2</sup> is moved from stage M to stage W and becomes ready to be retired successfully. The correctness property checks that


The correctness properties about memory *store* instructions are highly similar and thus omitted here.

#### *F. Processor Liveness*

All correctness properties discussed so far are safety properties. Our verification also tackled the important issue of processor liveness—demonstrating that the processor does not freeze so that the pipeline never progresses.

Of course, there are challenges when dealing with liveness. First, it is usually very difficult to prove liveness properties in practice, and there is no such thing as a bounded proof for liveness that can at least give some confidence. Second, even if a liveness property is proved, there is still no guarantee about *when* the desirable event will occur, which is not ideal when performance is critical. Third, a necessary condition for a processor to exhibit liveness is the correct behaviour of the external components connected to it. For example, if the memory never fulfils a load request, then the processor might wait indefinitely for a response, stalling the pipeline. This can be ruled out by assuming certain fairness constraints about the external components, but these can of course potentially be violated unless they are themselves verified.

There is a conventional workaround to the first two problems. Instead of proving the liveness property that 'the pipeline eventually progresses', we derive a *safety* property that 'the pipeline progresses within n cycles' parametrised by n and search for the smallest n (if it exists) for which the safety

Fig. 5. Microarchitectural state with branching instruction

Fig. 6. Microarchitectural state with load instruction in stage M

Fig. 7. Microarchitectural state with load instruction in stage W

Fig. 8. Microarchitectural state with register-only instruction in stage M

property can be proved. This not only averts the difficulty of proving liveness properties but also generates a concrete bound on when the pipeline progresses.

The derived safety property we proved for CHERI-Flute says that if an instruction enters stage E, then within nine cycles, either a new instruction enters stage E, or the processor enters one of three special states, triggered by particular instructions, that requires it to wait for certain external signals.

This property shows that as long as the processor does not enter one of the special states, new instructions will enter stage E periodically, so the pipeline never freezes. The number 'nine' is the smallest number for which this property can be proved, and the focus on stage E is because certain RISC-V instructions are retired in stage E—i.e. they are never moved into stages M or W. Asserting this property on any stage *prior to* stage E always attracts a counterexample where an instruction is repeatedly issued but never reaches beyond stage E, effectively stalling the subsequent stages.

Of course, the proof of this property relies on several fairness constraints. Most notably, it is assumed that the memory always fulfils a request within *two* cycles. The number 'two' here is arbitrarily chosen, and it is reasonable to conjecture that a different number can be used without making any substantial difference other than perhaps affecting the number 'nine' in the derived safety property.

#### VI. PROOF ENGINEERING

Not all our correctness properties can be proved in a pushbutton manner. Specifically, those properties about registeronly CHERI instructions as well as those about the registeronly components of branching and memory CHERI instructions *cannot* be proved straightforwardly. Instead, proof convergence on these properties relies on proof engineering methodologies that are explained in this section.

#### *A. Decomposing the Pipeline*

This methodology is called 'decomposing the pipeline' because it enables one to prove some property about a desired instruction when it is in a *later* stage of the pipeline by first proving some lemmas about the instruction when it was in *earlier* stages of the pipeline.

*1) The First Lemma:* The correctness property shown in Section V-C for any register-only instruction cannot be proved directly in JasperGold. Instead, we prove a structurally identical version of the property that is 'pushed back' one stage in the pipeline, referencing regfile<sup>M</sup> instead of regfile, rd<sup>M</sup> and V<sup>M</sup> instead of rd<sup>W</sup> and V<sup>W</sup> , and using a suitably adjusted guard<sup>M</sup> <sup>O</sup><sup>P</sup> function, as we sketch below.

If this version of the property can be proved, then it can be used as a lemma to successfully prove the original correctness property through k-induction [19]. The lemma is a property of a register-only instruction in stage M instead of stage W. Observe that the write-back result of any register-only instruction is computed by the ALU in stage E. Therefore, for any register-only instruction I<sup>1</sup> in stage M with opcode OP as illustrated in Fig. 8, its write-back result must already be available in vM. This means that we can assert

$$\mathbf{v}\_M = \operatorname{result}\_{\mathsf{OP}}\left(\mathbf{r} \oplus \operatorname{gf} \mathbf{i} \, 1 \oplus\_M \left[rs\right], \operatorname{imm}\right),$$

in the lemma, where the subscripted regfile<sup>M</sup> is used to take into account any forwarded value v<sup>W</sup> from stage W.

Now recall from Section IV-A that checks for guard conditions are spread across stages E and M. Thus, when instruction I<sup>1</sup> reaches stage M, only the checks in stage E have been performed, whereas the checks in stage M are still underway. Therefore, it is incorrect to assert that

$${"{g} and\_{\mathbf{OP}}}\left(\mathbf{re}\,\mathbf{g}\,\mathbf{f}\,\mathbf{i}\,1\,\mathbf{e}\_M\left[rs\right], imm\right)}$$

in the lemma. Rather, the lemma only asserts that the subset of instruction I1's guard conditions that are checked in stage E have been met. This subset is given by guard<sup>M</sup> OP.

Given the lemma, the original correctness property can be proved by k-induction. But without it, k-induction is unable to converge because for any value of k, the SAT-solver can always find a trace that violates the inductive hypothesis. Such a trace would begin at an *unreachable* microarchitectural state where the desired instruction is in stage M. It would then *stall* the pipeline during the next (k − 1) steps, only moving the desired instruction to stage W at the (k+1)-th step, where the inductive hypothesis fails to hold. The pipeline can stall for arbitrarily many cycles in such traces due to the absence of the very fairness constraints that enable the proof of the liveness properties in Section V-F. However, it is unnecessary to add fairness constraints here. Instead, we use the given lemma to prevent the SAT-solver from exploring such unreachable states. And since stage M is immediately prior to stage W, k = 1 is sufficient for the proof to converge.

*2) The Second Lemma:* To actually prove the lemma just explained, the same methodology is simply reapplied. That is, a *second* lemma is used to narrow the space of states in which the desired instruction is in stage E so as to exclude traces that violate the first lemma.

Fortunately, this second lemma is relatively easy to discover, since the only state information contained in stage E is the decoded content of the current instruction in stage E. Thus, the second lemma simply needs to assert that any instruction in stage E is properly decoded, which enables the proof of the first lemma by 1-induction.

Now this second lemma can, in turn, be proved by 1 induction if a similar *third* lemma is proved about stage D. And so on. This chain of lemmas stops, of course, at stage F where the last lemma can be proved directly. In practice, however, since CHERI-Flute's design of stages F and D is relatively simple, we took advantage of one of JasperGold's black-box proof engines to automatically complete the proof.

#### *B. Developing Microarchitectural Invariants*

CHERI instructions compute relatively sophisticated functions of their operands. In the Sail specification, these are given by total functions on all decompressed capabilities, including the unrepresentable ones mentioned in Section III-A. But since unrepresentable capabilities pose a significant problem if they appear in the processor, CHERI-Flute is designed so they can never be created by the hardware in the first place. CHERI-Flute is then excused from conformance with the specification for unrepresentable capabilities.

This, of course, leads to the generation of unreachable counterexamples in model checking, so our verification includes a global consistency invariant over the entire processor, showing that only representable capabilities are present. Formulating and proving this invariant was challenging because there are many internal registers in CHERI-Flute's microarchitecture that can influence the architecturally visible registers. A weak invariant that does not cover these internal registers cannot be proved by k-induction since the SAT-solver can always find an unreachable state in which one of these registers contains an unrepresentable capability, which then 'pollutes' one of the architecturally visible registers within the next few cycles.

This challenge was overcome using State-Space Tunnelling, a JasperGold feature that allows the user to prune unreachable portions of the state space when performing k-induction proofs. Essentially, it allows us to specify some k and let the SAT-solver generate a trace of length k that violates the invariant. The user then examines this trace to identify any internal register that causes the violation, and manually strengthens the invariant to include it.

This process repeats until, for some sufficiently large k, no violating trace can be found, at which point proof convergence for the invariant is achieved. In the end, the invariant in our proof was sufficiently strong to be proved by 1-induction.

#### VII. RESULTS AND EVALUATION

In this case study, the implementations of all 80-plus CHERI instructions (except a very few not yet implemented) have been subject to formal verification in JasperGold against the correctness properties in Section V through the proof engineering methodologies in Section VI. <sup>1</sup> While the implementations of most instructions were found to satisfy the correctness properties, several were found to be buggy.

The bugs found roughly fell into two categories. The first category are simple coding mistakes: the designer failed to notice details of the specification, or the specification changed after the design was created. These bugs are usually detectable with a moderate amount of scrutiny or simulation testing. The second category are algorithmic errors, typically caused by subtle mistakes in complex pieces of logic. These are much more difficult to uncover, even with the most intensive code review or simulation testing.


These two bugs have been confirmed and fixed by the designers [11], [13]. The following have also been confirmed by the designers and fixes are pending:


One final bug illustrates an especially productive collaboration between verification and design: in the setAddress

<sup>1</sup>On a 24-core AMD EPYC 7F72 processor, with 256 GB of RAM, the proofs are completed within two hours through parallelisation.

function, the validity tag of the returned capability is cleared incorrectly in a corner case.

This function was originally developed by trial and error using the BlueCheck automated test generation framework [20] and as well as TestRIG, a framework for testing RISC-V processors with random instruction generation [21]. But neither method detected this corner case. The designers' initial patch for the function was buggy because it mishandles another corner case, which was yet again detected by formal verification. Consequently, we redesigned the function from scratch and formally verified its correctness against the specification before it was submitted to and accepted by the designers [12].

#### *A. Bug or Feature?*

Two issues belong to an interesting category sometimes encountered in formal verification: a trace violates the specification, but it is unclear whether the hardware should be changed to match the specification or *vice versa*.

The first was that specification requires the CSetOffset and CIncOffset instructions perform a standard 'representability check' to determine if the capabilities they return are representable. But in CHERI-Flute the CSetOffset instruction performs a slightly different, non-standard check optimised for that particular instruction, although the CIncOffset instruction uses use the standard check.

So the behaviour of the CSetOffset instruction violates the specification, but in a beneficial way. It is therefore up to the designers to decide whether the specification should be changed to incorporate this optimised representability check.

The second was that, when trying to prove the global consistency invariant, we found counterexample traces where memory corruption causes injects corrupted capabilities into the core. Since memory bit-flips do occur in actual hardware, we suggested that the core should perform sanity checks on any capability retrieved from the memory, clearing its validity tag if it is found to be corrupted.

In the end, the designers decided not to add the sanity checks because it may cause even more unexpected behaviour when memory corruption occurs, making the situation more complex to debug. So to make the proof of the global consistency invariant converge, we added an assumption that the memory never returns a corrupted capability.

#### VIII. RELATED WORK

The correctness of processor cores and their implementation of instructions has been a focus of verification research for decades, going at least back to the pioneering work of Hunt on verifying the FM8501 [22] and FM8502 processors [23]. To verify more complicated, pipelined designs, Burch and Dill devised the flushing abstraction [24], a member an extensive family of formulations of correctness that has expanded to cover even out-of-order designs. Aagaard et al. [25] present a useful framework for classifying these different approaches.

From about the mid 1990s, verification was increasingly adopted in industry to verify critical components of largescale designs. Notable experiments include Kaivola et al.'s verification of the Pentium 4 floating-point divider [26], Jacobi et al.'s fully automated verification of fused-multiply-add floating-point units [27], Kaivola's methodology for largescale formal verification of control-intensive circuits [28], and Slobodova's verification of AES hardware support [29]. A landmark achievement in this direction was Kaivola et al.'s work on replacing testing with formal verification for validating the core execution cluster of the Core i7 design [30].

The starting point of our work was Reid et al.'s end-toend verification of Arm processors [31]. But our approach to verifying properties differs significantly from this work. While the Arm verification uses bounded model checking, we obtained much stronger unbounded proofs of all correctness properties by extracting microarchitectural invariants. Of course, the relative simplicity of RISC-V helped make this possible, but it was also enabled by the complexity management methodologies we explain in this paper.

A landmark in the verification of complex cores is the work by Goel et al. [32] on verifying x86 instructions. This was done using the ACL2 theorem prover in concert with a number of tightly integrated support tools, and achieved an end-toend verification that encompasses decoding, translation into microcode, traps to microcode ROM, and execution.

There has been related work on verifying processors using Symbolic Quick Error Detection (SQED) and its variants [33], [34], [35]. These methodologies use bounded model checking to find sequence-dependent bugs that violate a self-consistency property, but they are not intended for checking singleinstruction bugs where an instruction always produces the wrong result for certain inputs [33]. In contrast, our methodology checks for both types of bugs. Indeed, most, if not all of the bugs we found were single-instruction bugs that could not be uncovered by checking for self-consistency. Instead, a more traditional approach using a formal specification was required.

### IX. CONCLUSIONS AND PROSPECTS

There are several ways in which the present work can be improved and extended.

For this project, we manually translated the Sail specification of CHERI-RISC-V into SVA. It would obviously be preferable to have an automatic translation, and we are investigating some options for this. Apart from the usual benefits of automation, automatic translation could eliminate the pragmatic need to weaken the specification as described in Section V-A. As Sail has been adoped by the RISC-V Foundation for its golden formal model, a flow from Sail to SVA seems highly desirable in any case.

Further work can also be done to address the drawbacks of the liveness properties described in Section V-F. For example, it would be ideal to remove the proof's reliance on fairness constraints that contain arbitrarily chosen numbers. Also, the work can be made more complete by proving liveness properties about pipeline stages subsequent to stage E.

Attempts could be made to verify more complex CHERI-RISC-V processors, such as Toooba [36], where the main challenge will be to formulate correctness properties about an *out-of-order* microarchitecture. We note, however, that the SystemVerilog functions translated from the Sail specification during the present work can be completely reused when formulating the new correctness properties.

Finally, we mention that in 2019, the UK announced its *Digital Security by Design* programme with £190 million of funding for a set of research projects [37] to 'radically update the foundation of our insecure digital computing infrastructure, by demonstrating that mainstream processor technology . . . can be updated to include new security technologies based on the CHERI Architecture' [38]. A cornerstone of the programme is Morello [39], a CHERI-enabled prototype developed by Arm and scheduled for release in late 2021. We hope that this early RISC-V case study provides at least some insights that might eventually apply in the formal verification of Morello.

#### X. ACKNOWLEDGEMENTS

We are grateful to members of the CHERI group at Cambridge. Alasdair Armstrong, Alexandre Joannou, Simon Moore, Peter Rugg, Peter Sewell, Robert Watson, and Jonathan Woodruff all kindly provided assistance or comments on this work. Thanks also go to Ziyad Hanna at Cadence and to Joe Stoy at Bluespec, who thoughtfully answered our questions about Bluespec SystemVerilog. This work was funded in part by the UKRI programme on Digital Security by Design (Ref. EP/V000225/1, SCorCH [40]).

#### REFERENCES


# Hardware Security Leak Detection by Symbolic Simulation

Neta Bar Kama Core and Client Development Group Intel Corporation Haifa, Israel neta.bar.kama@intel.com

*Abstract*—Aiming to expose security risks in hardware designs, we describe a novel usage of symbolic simulation that led to discoveries of previously unknown potential local data leakages on an Intel Core processor design. Symbolic simulation is an established formal verification method, the main vehicle for verification of arithmetic data-paths in Intel Core processor designs for twenty years. It extends traditional simulation by allowing symbolic variables in the stimulus, covering the circuit behavior for all possible values simultaneously. A special trait of symbolic simulation is that every variable has a name. In the security context, named values allow us to know the exact origin of data and identify data leakages by determining whether values are expected to be read by an operation or present a risk. Leveraging the existing formal verification infrastructure and observing an operation's data dependencies we could identify local leaks without the need to have a complete functional specification for the operation.

*Index Terms*—Security, Data Leakage, Formal Verification, Symbolic Simulation

#### I. INTRODUCTION

Comprehensive formal verification of execution engines has been standard practice in virtually all Intel® Core™ processor development projects in the last two decades, and extensive infrastructure has been built to support these efforts. The technical basis of this work is symbolic simulation, a technology extending usual digital circuit simulation with symbolic values, representing sets of concrete values in a single simulation.

In the aftermath of the Spectre and Meltdown vulnerabilities, security has become a greater focus area for validation. In this paper we discuss a novel approach leveraging the existing formal infrastructure for Intel Core processor Execution clusters (EXE) to analyze potential data leakages, security violations where privileged data could be made visible to nonprivileged parties. The approach is based on the special feature of symbolic simulation that stimulus values have names that can be used to uniquely relate a value to a specific signal and time.

Intel provides these materials as-is, with no express or implied warranties. Intel processors might contain design defects or errors known as errata, which might cause the product to deviate from published specifications. No product or component can be absolutely secure. Intel, Intel Core, Intel Atom, Pentium and Intel logo are trademarks of Intel Corporation. Other names and brands might be claimed as the property of others.

Roope Kaivola Core and Client Development Group Intel Corporation Portland, OR, USA roope.k.kaivola@intel.com

Below we first discuss the concept of symbolic simulation and its use in EXE formal verification, and the security challenges in EXE. Then, we will describe the principles of our solution analyzing potential data leakages using symbolic simulation, practical considerations in the implementation of the solution over a live Intel Core processor development project, and the results of our experiments. With a moderate engineering effort, we were able to extend the existing formal environment with extra checkers detecting potential data leakages. On the one hand, this allowed us to verify the absence of data leaks for large classes of micro-operations, and on the other to identify several previously undiscovered local data leakage issues, where micro-operations unintentionally wrote back data that had been left behind in the internal state of the cluster by a previous micro-operation.

The closest counterpart to our work in the scientific literature or commercial tools is taint analysis [1], [2], [3], [4]. Like our approach, taint analysis tracks the propagation of values from one signal to another. However, taint analysis works by attaching extra information, the 'taint', to simulation values to track their progress, and requires extra engineering either in the simulator or in post-simulation analysis. In our approach values are tracked using the symbolic variable names already present in the symbolic simulation for the verification, and we only needed to implement a thin analysis layer on top of the existing collateral. Second, taint analysis generally assumes a static classification of signals to 'secret' and 'non-secret' and analyzes possible paths leaking secret values to non-secret signals. This does not adequately reflect the common design pattern of pipelined designs, like the EXE cluster, where the same signals are used to carry both secret and non-secret data at different times, and the notion of a 'secret' is relative to a micro-operation. To our knowledge, our work is among the first published explorations of the application of symbolic simulation into security verification of hardware designs (cf. [2], [5]).

### II. SYMBOLIC SIMULATION IN EXE VERIFICTION

#### *A. Symbolic Circuit Simulation*

Digital circuit simulation is a standard tool in the arsenal of every working circuit design and validation engineer. Symbolic simulation extends this technology with the ability to carry out

Fig. 1. Symbolic expressions in simulation


Fig. 2. Logic with the undefined value X

Fig. 4. Symbolic trace

a simulation using symbolic representations of sets of values in a single simulation trace [6], [7].

In a symbolic simulator the input stimulus may contain symbolic variables in addition to the traditional concrete values 0, 1, *X* or *Z*. These symbolic variables are effectively names of values, denoting sets of possible actual concrete values. In the simulation, these symbolic values propagate alongside the constant values, and in each logic gate, they may be combined with each other or one of the constants to result in either a logical expression on the symbolic variables, represented by an expression graph, or a constant. See Figure 1 for an example.

In a bit level symbolic simulator a single symbolic variable a corresponds to the set of Boolean values containing both 0 and 1. If stimulus to a symbolic simulation refers to the variables *a*, *b* and *c*, the internal signals might carry values like *a*∧*b* or *a*∨(*b*∧¬*c*). Usual logic rules apply: if the inputs to an ANDgate are *a* and 1, the output will be *a*, if the input to a NOT-gate is *b*, the output will be ¬*b*, and if the inputs to an AND-gate are *a* and *b*, the output is the logical expression *a*∧*b*. In symbolic simulation, a specific symbolic variable is associated with a specific signal and time in the stimulus. Associating a variable with a signal at a time does not fix the value, but instead gives a name that can be used to refer to the value.

In symbolic simulation, the constant value *X* is used to denote a universal undefined or unknown value, which propagates according to rules depicted in Figure 2. The value *X* denotes lack of information: we do not know whether the value is 0 or 1. The propagation rules reflect this intuition. Symbolic simulation uses *X*'s as an abstraction mechanism: unlike symbolic variables, *X*'s are an over-approximation of Boolean circuit behavior. Both symbolic variables and *X*'s allow us to verify a property over a single symbolic trace, and conclude that it is valid over every possible trace instantiating the *X*'s and the symbolic variables with 0's or 1's.

Figure 3 depicts a simplified pipelined ALU circuit with a 16-bit wide two-cycle data-path from sources to writeback. Figure 4 depicts a typical symbolic trace that might be used in the verification of this ALU, focusing on a single instance of an eight-bit wide bitwise OR micro-operation. The control signals are driven with concrete values corresponding to the operation, and the source data is driven with symbolic variables *a*[15],...,*a*[0] and *b*[15],...,*b*[0] in the one cycle in which the operation is issued. In all other cycles these signals are driven with the undefined value *X* (gray waveform). In the simulation, the values of the write-back data and zero flag two cycles later are then expressions on the symbolic variables associated with the source data.

A single symbolic simulation trace corresponds to a set of ordinary simulation traces, covering behaviors of the simulated circuit for all the possible instantiations of the symbolic variables with concrete values. The ability to cover all behaviors forms the basis of using symbolic simulation as a formal verification method. In this role symbolic simulation excels in verification of deep targeted properties of fixed length pipelines, typically of the transactional form *stimulus A at time t is followed by response B at time t* + *n*. It has a unique ability to carve out the circuit logic relevant to the progression of a pipeline while ignoring the rest of the circuit and other transactions in flight. As the approach is conceptually simple and concrete, it gives the human verifier a finegrained visibility into the progress of the computation during a verification task, enabling precise analysis and mitigation of computational complexity bottlenecks. Because of these advantages, symbolic simulation can routinely handle circuits that are magnitudes above the capacity of more traditional formal property verification approaches, as well as circuits where the pipelines are too enmeshed to be amenable to equivalence-based verification methods.

#### *B. Execution Cluster*

Intel Core processor architecture has evolved gradually over the years. Typically, a new design project maintains functional backwards compatibility with earlier designs while providing improvements along different axes: new instructions and capabilities, improved performance or power, or design adjustments to meet side conditions set by a new manufacturing process. A design project routinely inherits components from earlier designs.

At high level, a single core consists of a set of major design components called *clusters*. The front-end cluster fetches and decodes architectural instructions, translates them to microoperations and computes branch predictions. The out-of-order cluster receives streams of micro-operations from the front end, keeps track of dependencies between them, schedules ready-to-execute micro-operations for execution, takes care of branch misprediction and event recovery, retires completed instructions, and updates architectural state. The execution cluster carries out data computations for all micro-operations implemented by the design, performs memory address calculations, and determines and signals branch mispredictions. The memory cluster handles memory accesses, may contain first level caches and interfaces with a system-on-chip layer outside the core, including for example a graphics processing unit and a memory controller. The SystemVerilog source code of a cluster usually contains several hundred thousand lines of code. While not a physical entity like the above, microcode is also a major design component, the complexity of which is comparable to that of the clusters.

In this paper we focus on security validation of the execution cluster (EXE) on an Intel Core processor design. The EXE cluster consists of six main units: the integer execution unit (IEU) contains logic for plain integer and miscellaneous other operations, the single instruction multiple data (SIMD) integer unit (SIU) contains logic for packed integer operations, the floating-point unit (FPU) implements plain and packed floating-point operations such as DIV, MUL, ADD, etc., the address generation unit (AGU) performs address calculations and access checks for memory accesses, the jump execution unit (JEU) implements jump operations and determines and signals branch mispredictions, and the memory interface unit (MIU) receives load data from and passes store data to memory cluster, maintains store forwarding buffers, performs various datatype conversions, and takes care of data bypassing. In a typical contemporary Intel Core processor design, the EXE cluster implements over 5000 distinct micro-operations and supports multi-threading.

At an abstract level, the EXE cluster is a pipelined machine, receiving as input streams of micro-operations (micro-ops, uops) through a set of schedule ports. Each micro-operation receives its source data either through the cluster interface or through a bypass from a previous operation, and produces its result through a write-back port after an operation-dependent latency. The cluster has state components, which a microoperation may read or update synchronously.

### *C. EXE Formal Verification*

Formal verification of arithmetic data-paths has been a focus area at Intel ever since the Pentium® FDIV bug in 1994. The primary vehicle for this work is symbolic simulation, incorporated in Intel's in-house Forte verification toolset under the name of Symbolic Trajectory Evaluation (STE) [7]. Initially a research initiative during the Pentium Pro design cycle, Formal Verification has been carried out as a routine part of Intel processor development projects since Pentium 4 in 1999. All Intel Core processor EXE data-paths since 2005, as well as most Intel Atom® processor and Gen Graphics arithmetic engines have been formally verified using symbolic simulation [8], [9].

In concrete terms, EXE formal verification is carried out through a shared verification system called Cluster Verification Environment (CVE), a large software artifact that creates a standard, uniform methodology for writing specifications and carrying out verification tasks [8]. Underlying CVE is the Forte/reFLect toolset, consisting of the high performance simulator STE wrapped in a full-fledged functional programming language [7]. All verification takes place at the level of the full cluster, not the underlying individual units.

In verification of the EXE cluster, every micro-operation and every port on which the micro-operation can execute correspond to a separate symbolic simulation task. This simulation starts from a totally unconstrained initial state and focuses on one instance of the micro-operation under verification. The control signals that are relevant to the micro-operation are restricted according to the micro-operation, and the source data signals are driven with symbolic variables, as in the simplified example in Figure 4. Additionally, some internal and external control signals of the circuit are driven with symbolic variables and may be restricted using control invariants that are used to capture reachable state restrictions. Due to the unconstrained initial state of the simulation, such reachable state restrictions are not automatically accounted for in the verification and need to be manually formulated and separately verified. All other signals in the simulation are driven with the undefined value *X*. Altogether, in this setup the single instance of the microoperation under verification in the single symbolic trace covers all possible invocations of the micro-operation in any legal trace of the circuit.

Effectively, in the verification setup for a single microoperation the control signals are set to fix the data-path controls to match a single instance of that micro-operation, and symbolic variables on the data are used to exhaustively simulate the data-path instance. The simulation is then connected to an abstract functional reference model for the micro-operation through source and write-back mappings, and the output of the design and the reference model compared. These designdependent mappings extract the intended source and result values for the micro-operation at the relevant times relative to the instance we are verifying.

For a large majority of micro-operations in the EXE cluster, the data-path can be exhaustively symbolically simulated in one pass at the full cluster level. For certain complex operations like floating-point addition, careful case splits on the data space are needed to contain symbolic expression growth in the simulation, and for most complex operations like floating point divide or fused multiply add, a sequential decomposition strategy is applied.

# III. EXE SECURITY VERIFICATION

#### *A. EXE and Data Security*

Traditionally EXE validation has focused on the functional correctness of the micro-operations, including the validation of control logic required for non-interference from other operations simultaneously in flight. Since the Spectre and Meltdown vulnerabilities, security validation has become a greater focus area. In both exploits, a rogue process can theoretically gain access to privileged data by observing the side effects of speculative, although ultimately unsuccessful access to a memory location containing the secret. A key ingredient of these exploits is that secret data temporarily propagates and influences execution flows in the micro-architectural level, although the results of the computations on the secret data are appropriately squashed before they become architecturally visible. In the classic functional correctness sense this is not a problem, as the secret data is never directly exposed. However, in the exploits a rogue process tracks the ways in which the secret data has influenced the execution flows, especially through timing analysis, in an effort to statistically deduce the secret with a high probability. This means that we need to secure the propagation of secret data also at the microarchitectural level. As it is difficult to foresee all the ways in which the secrets' influences on execution could be exploited, the best strategy is to try to limit the propagation of secrets in the system as best as we can, and try to block any leakages at a local level as early as possible.

Looking at the EXE cluster from the security and data leakage perspective, the first thing to note is that in the larger context some micro-operations may be privileged, and some may not, some data may be secret, and some may not, but EXE has no awareness of that. All it sees are micro-operations and data. Privileged and less privileged operations are interleaved out-of-order in the same thread and between threads. The mixture of secret and non-secret makes it harder to formulate a property *Thou shalt not leak secrets*, as we don't have a good measure of what counts as a secret. However, each microoperation has a well-defined notion of the data it is expected to process: which buses at which times relative to the operation carry its source and result data. Relative to an operation, we can then over-approximate all other data as secret. This leads to the following fundamental security property for EXE:

# *For every micro-operation executing in EXE, its result data should be exclusively a function of its source data.*

By 'result data' we mean the main write-back data bus, flags, faults, and all auxiliary outputs together. This security property can be formalized more accurately as:

*For every micro-operation u, there is a function spec*(*u*) *such that for every trace T of the circuit and every point t of T , if uop u is issued at point t of T and we write src for the source data of u and wb for the write-back data of u relative to the point t of T , then wb* = *spec*(*u*)(*src*)*.*

For many micro-operations, this security property follows automatically from functional correctness. If the specification for the operation is fully defined for all possible source values, and we have verified that the implementation fully agrees with the specification, there is simply no logical possibility for the result data not to be purely a function of the source data. However, many operations have partially undefined results, where some result components are unspecified either for all or some source values. For example, some floating-point microoperations do not fully support all possible source values, reverting to microcode flows for rare or hard-to-implement cases, leaving the result data undefined. Similarly, certain helper operations that are used only in specific microcode flows in contexts where some parts of the result are never used may leave these result components undefined. Designs take advantage of the undefined spaces, as they allow an implementation to be optimized without a need to maintain identical behavior in the undefined space. These undefined spaces provide an opportunity for a micro-operation to write back values that are derived from some other data than its sources, including possibly secret data that has been or is being processed by other micro-operations.

The most common scenario of data leakage in undefined spaces is when secret data processed by an earlier microoperation lingers in some internal flops of EXE and is passed to the write-back bus as a later micro-operation's undefined result. In a fully pipelined machine where all clocks toggle all the time, this scenario cannot happen, as secret data stays in any pipe-stage for exactly the one cycle when it is being processed before being overwritten by the next wave of values. However, such always-toggling designs are a thing of the past. Qualified clocks are ubiquitous, and their use increases and becomes more fine-grained by every design generation because of power considerations. In many data-paths the clocks toggle at most once for each operation. This means that any secret data processed by an operation remains in internal flops in every pipe-stage, until the next operation executing in the same data-path clears it. In this context the security property above can be viewed as setting a security perimeter around EXE. Secret data can linger on inside the cluster but cannot be exported through the write-back bus by any micro-operation.

The general concept of the analysis of data leakages through undefined behavior is directly relevant for the prevention of Meltdown-type vulnerabilities, although the areas primarily contributing to Meltdown are outside our focus area in EXE. An essential part of Meltdown is transient execution after a faulting load micro-operation from an out-of-bounds memory location containing secret data [10]. While the problematic load micro-operation produces a fault due to an access check violation, it may, under certain circumstances, nevertheless have read the secret value from the memory location and passed the value on to a subsequent flow that exposes the secret. The specification for a load micro-operation is likely to be of the form *if the load does not generate a fault, the writeback data will be the value held by the memory location pointed to by the sources, otherwise the writeback data is a don't-care*. Note that the naive specification, without the faulting condition and the don't-care space, is very unlikely to hold for any real implementation, as a load can fault for a variety of reasons, many of which prevent the routing of the memory data to the writeback. This undefined space in the specification allows the secret to be exposed, or conversely, as pointed out by Canella et al: *". . . merely replacing the data of a faulting instruction with a dummy value suffices to block Meltdown-type leakage in silicon. . . "* [10, p 252].

#### *B. EXE Security Analysis with Symbolic Simulation*

Considering the fundamental security property formulated above, an extremely useful feature of symbolic simulation is that every symbolic variable can be uniquely related to the signal and time it was associated with in the stimulus. Each 1 in stimulus looks exactly like any other 1, each 0 like any other 0, but every symbolic variable carries immediately in its name the notion of which signal and time it originated from. The uniqueness of names and the setup of EXE verification allows us to re-phrase the security property as:

*For every micro-operation executing in EXE, the symbolic expressions for its result data should only refer to symbolic variables associated with its source data, and should not allow the undefined value X.*

This property is relative to the symbolic simulation task for the micro-operation, as outlined in Section II-C. The symbolic re-formulation of the security property guarantees the original version since the single symbolic simulation for the micro-operation is an over-approximation of every possible invocation of the micro-operation in any trace. This means that we can simply read the function *spec*(*u*) required by the original definition, mapping source data to the result, from the symbolic expressions for the result data.

Another way of viewing the matter is that the symbolic expressions on the write-back signals fully capture all dependencies of the write-back on any signals in their fan-in cone. The constant values in the simulation do not matter in this respect. Since the symbolic simulation for the micro-operation over-approximates every possible invocation of it in any trace, every constant value in the symbolic simulation is also present in all these invocations. Consequently, the propagation of such constants in the simulation to the write-back cannot disclose anything about the internal state of the circuit that would not be universally true. As a technical restriction, in our work all case splits and decompositions used to alleviate verification complexity are on data and not on control signals and will not turn any symbolic variables on control signals to constants.

Notice that the symbolic formulation of the security property is not a property about the value of the result data itself. Instead, it is a property about the symbolic expression used to represent the value of the result data in the simulation, and the symbolic names that occur in that expression. Because it talks about names, not values, it is not something that could be coded in methods that describe properties of signal values, such as SystemVerilog Assertions.

When we run a micro-operation that has a fully specified result data, we naturally verify that it writes exactly the data we expect it to and nothing else, as otherwise the verification would fail. However, when there is an undefined space in the output, the situation is trickier because we don't know what value to expect. The use of named variables allows us to verify that the result data is a function of the source data without the need to say what that function *spec*(*u*) is, i.e. without needing to specify the expected result value. This is very efficient when we are looking at the undefined space, where typically there is no good definition of what the result should be.

#### *C. Implementation*

Next, we describe in detail how this idea was implemented. In high level, named variables allow us to:


The security analysis has two outcomes. First, we can detect security vulnerabilities where they exist. Second, the absence of detected vulnerabilities for the vast majority of microoperations provides strong evidence that no secrets can be leaked to the interface of the cluster through those operations.

Data propagation in the circuit is often gated by specific operations that exclusively enable the data flow. If that enabling is too short, and there is no mechanism that clears the data after the operation, it can hang there. Stale data becomes a security risk when another operation can read this data. In early stages of verification environment development for a new project, the validation focuses on pure data-path verification in a sterile environment, and as a simplification, disables power gating and lets clocks toggle freely. At this stage all data flows uninterrupted, and we cannot guarantee there are no leakages coming from stale data on a powergated bus. Security verification analysis becomes effective and meaningful only when we enable all power optimizations in the formal environment. At the time we started this security initiative, this pre-condition was met in almost all areas of the design we were working on.

Formal verification of arithmetic data-paths in the EXE cluster is fully covered in CVE using symbolic simulation. We have specifications for all existing micro-operations and the infrastructure to run a full regression to collect any information needed for the extra layer of security check. This provided a solid base for our analysis, and an efficient process that led to interesting results in a short time. The process can be divided into three stages.

	- Each uop in CVE has a defined data type signature, which specifies useful static information about the shape of the sources and result of the uop, such as data size, data type (integer, floating-point), signed/unsigned etc. The source or write-back data can be of NULL type, meaning it is not used by the uop. For NULL write-back, the checkers will not sample the write-back bus at all in a simulation.
	- A uop may have a defined write-back datatype, but its specification may explicitly encode a don't-care space. For example, the data output of a divide operation could be defined as a don't-care when the divisor is zero. In this case the checkers will sample the output in a simulation but will ignore the value for the functional correctness check. In the eight-bit OR example, we could sample the full 16 bit writeback bus, but not necessarily check the upper eight bits, leaving them explicitly undefined.

For both methods the existing CVE data structures allowed us to easily identify the set of uops that produce undefined results, creating a clear goal for the main security analysis. The first step in enabling the security check was to switch from the first method to the second one for all uops, to make sure we always sample the write-back bus: identify the uops using the first method, convert the NULL data signatures to a meaningful type, and incorporate the explicit don't-care space into the functional specification.

2) *Sample results and detect unexpected variables.*

This stage is the heart of the process, using the existing symbolic simulation capability in the two steps above: A) Sample the output and extract the list of variables in the symbolic expression, and B) Identify suspicious variable names in the list. The ingredients of this stage are:


above. As sampled by the operating uop, they are considered safe.


Given the values in the write-back bus, we check for *X*'s and query the variable dependency list for suspicious names. In the eight-bit OR example of Figure 4, there are no *X* values, and the dependency list includes only 'good' names such as *a*[7] or *b*[0].

This check is fully automated, as the classification of variable names to good vs suspicious ones can be done mechanically based on existing information about the intended uop source interfaces and variable naming conventions.

3) *Trace the suspicious variables.*

The presence of the undefined value *X* or a suspicious name in the dependency list does not yet automatically mean that what we see is real data leakage. By methodology, symbolic simulation uses a maximally uninitialized start state for the simulation, with all signals having the value *X*, and uses stimulus that drives *X*'s on most inputs to the circuit, overapproximating the real legal behaviors of the circuit. We need to trace the suspicious variable or *X*, see how it propagated to the write-back, and understand whether the path to the write-back is possible in the real operating environment of the circuit. This stage is like the debug process of any simulation, tracing the origin of a value in the circuit. We use a schematic viewer that shows symbolic values and trace the ones that we find interesting. In some cases, to better analyze a behavior, we strengthen the simulation to drive a variable at an internal signal that used to hold an unnamed *X* that may propagate to the write-back.

Consider for example the simplified ALU of Figure 3 and assume that the circuit is augmented with power gating logic that turns off clocks for the high eight bits [15:8] of the datapath for operations that only operate on the low eight bits [7:0] of data. If we now simulate an eight-bit OR operation on the circuit as in Figure 5, we might observe *X* values in bits [15:8] of the write-back as in Figure 6, instead of the 'good' result of Figure 4. Tracing back the *X* values on the write-back, we would find an internal flop with the output *X* and a clock that does not toggle, as in Figure 7. In the circuit, this flop will hold any value the previous operation has left there, presenting a leakage risk. To check whether this data really propagates to the output, we want to track a concrete named variable. To do this, we drive unique named variables "Src1[15]@23" ... "Src1[8]@23" to the internal flop as in Figure 8, and observe these variables in the write-back, as in Figure 9. Once we understand the leakage mechanism, we can

Fig. 5. Clock gating for eight-bit operations

Fig. 6. Sampled *X* on the write-back bus

Fig. 8. Replace the *X* with a named variable

Fig. 9. Symbolic waveform with data leak

Fig. 10. Concrete waveform with data leak

then manually generate a concrete example exhibiting both an earlier uop leaving behind stale data, and a later uop that leaks the stale data to the write-back bus, as in Figure 10. In this example, the high eight data bits of a 16-bit uop A remain in the internal state until they are overwritten by the next 16-bit uop C, and are exposed by the 8-bit uop B in the meanwhile.

#### IV. RESULTS

The flow of security verification was implemented as an automated extra check on top of the traditional data-path symbolic simulation. The process leveraged the existing capabilities of CVE that already supported all EXE uops. This gave us the ability to run a full regression and get first results quickly.

We chose to focus on the write-back data interface buses and concentrated on the about 2000 uops for which these buses are relevant, out of about 5000 legal uops for the cluster in total. Among these uops we first identified the ones that have fully or partially unspecified write-back data. Our analysis showed that 89.4% of the uops were completely specified, and 10.6% had unspecified write-back data. We then further analyzed the uops with unspecified write-back data by symbolic dependency analysis and found that 97.8% of uops were either completely specified or exhibited no unexpected data at write-back, whereas 2.2% of the uops had an undefined result space and failed the dependency analysis.

For the 97.8% of the uops that passed our analysis, we provided strong evidence that there is no risk of data leakage, as our analysis took place in the formal framework covering all possible behaviors. Note also that the dependency analysis allowed us to reduce the ratio of suspicious uops from 10.6% to 2.2%. As a restriction in scope, we did not look at data leakages in the bypass network, although the method would be equally applicable there.

The first real local EXE potential data leakage was discovered in less than a month. In a total effort of about two months of work, we discovered several different potential leakage mechanisms, all previously unknown. The failures were analyzed and grouped to RTL bugs with a common cause. Examples of potential leakage mechanisms include:


3) Most uops that write only part of the write-back bus, for example 32 bits out of 128, have a clear definition of the unused bits, and we sample them along with the computed result of the lower part in regular datapath verification. In one exception, the upper part for a specific uop D was left unspecified. Tracing back the write-back, we reached an internal source bus shared by several operations, with a clock toggling just once per uop, causing the data to hang. Usually, the next uop would clear the bus. Uop D did not, leaking the upper bits of the source data left behind by the previous uop.

These bugs were all reproduced in normal simulation. They did not cause a functional failure: the results are never checked since they fall into the don't-care space of the specification. However, it was clear that the value written to the write-back is exactly the value left behind by a previous uop.

After the detection of these kinds of potential data leaks, there are several options for actions to fix them. The straightforward solution is to modify the currently undefined uop to have a defined value, e.g. write zeroes to the write-back data. This will be the easiest to verify because it will become again a strongly defined data-path verification task. It will also be the strongest solution, as it truly closes the leak. Another solution is to clear the stale data left by the earlier uop, for example by opening the gating clock for an extra cycle. Both options close the leak at the EXE boundary but require changing the design and could cost power or area.

If it is not possible to fix the design, another option is in the microcode level, making sure the undefined operation is not used in any way it could be exploited. Effectively here one establishes a security perimeter with a larger scope than EXE to see that the compromised data is contained before it becomes visible through a vulnerability at a higher level. This method is less optimal than the ones above, as the analysis scope is larger, outside the scope of existing formal tools, and relies more on finding parallels with known vulnerabilities, while new ways of exploiting information leaked out of the cluster may emerge. Also, micro-code implementation is dynamic, and it is possible that changes to the usage model that is safe today may make it unsafe tomorrow.

The potential local data leakages discovered by our analysis were addressed during the design project and as a result do not lead to a security violation at a user visible level in the final product.

#### V. SUMMARY

Symbolic simulation's special trait — the usage of named variables — makes it a productive method to analyze data leakage risks. The scope of this work was huge for any formal analysis: a whole cluster, thousands of operations, and hundreds of thousands of flops in the circuit. Out of those, without having any prior knowledge where to look for the risks, we hit the relatively few instances that mattered in a short time. We found real issues, in a live project, issues that were not detected by any other method.

In this paper we described how we leveraged the existing environment of CVE that already supports the thousands of specifications in EXE cluster, holds information about data types and has a clear naming convention. This made the process efficient and demonstrated the importance of the complete verification environment covering EXE data-path. It is also important to clarify that the general concept we describe here is not dependent on it. Security verification by symbolic simulation can be implemented in various designs, where we do not have such infrastructure to rely on. Symbolic simulation is the key in analyzing data leakage risks of this kind, not the formal environment in itself.

In future design projects, with the increasing demand for security validation, we hope to explore where we can further develop this usage of symbolic simulation.

#### ACKNOWLEDGEMENTS

We would like to thank Arkady Neyshtadt for his security analysis, Gilad Holzstein, Robert Jones, Alex Levin, Yoav Moratt and Nir Shildan for discussions on security, Annette Upton for detailed feedback on the paper, and David Turner, Yaniv Dana and Alon Flaisher for the opportunity to carry out this work.

#### REFERENCES


# Scaling Up Hardware Accelerator Verification using A-QED with Functional Decomposition

Saranyu Chattopadhyay<sup>∗</sup> , Florian Lonsing <sup>∗</sup> , Luca Piccolboni † , Deepraj Soni¶ , Peng Wei§ , Xiaofan Zhang<sup>k</sup> , Yuan Zhou‡ , Luca Carloni† , Deming Chen <sup>k</sup> , Jason Cong § , Ramesh Karri¶ , Zhiru Zhang‡ , Caroline Trippel<sup>∗</sup> , Clark Barrett <sup>∗</sup> , Subhasish Mitra<sup>∗</sup>

<sup>∗</sup>Stanford University, †Columbia University, ‡Cornell University, §University of California, Los Angeles,

¶New York University, <sup>k</sup>University of Illinois, Urbana-Champaign

*Abstract*—Hardware accelerators (*HAs*) are essential building blocks for fast and energy-efficient computing systems. *Accelerator Quick Error Detection (A-QED)* is a recent formal technique which uses Bounded Model Checking for pre-silicon verification of HAs. A-QED checks an HA for *self-consistency*, i.e., whether identical inputs within a sequence of operations always produce the same output. Under modest assumptions, A-QED is both sound and complete. However, as is well-known, large design sizes significantly limit the scalability of formal verification, including A-QED. We overcome this scalability challenge through a new decomposition technique for A-QED, called *A-QED with Decomposition (A-QED*<sup>2</sup> *)*. A-QED<sup>2</sup> systematically decomposes an HA into smaller, functional sub-modules, called *sub-accelerators*, which are then verified independently using A-QED. We prove completeness of A-QED<sup>2</sup> ; in particular, if the full HA under verification contains a bug, then A-QED<sup>2</sup> ensures detection of that bug during A-QED verification of the corresponding subaccelerators. Results on over 100 (buggy) versions of a wide variety of HAs with millions of logic gates demonstrate the effectiveness and practicality of A-QED<sup>2</sup> .

### I. INTRODUCTION

Hardware accelerators (*HAs*) are critical building blocks of energy-efficient System-on-Chip (*SoC*) platforms [1]–[3]. Unlike general-purpose processors, HAs implement a set of domain-specific functions (e.g., encryption, 3D Rendering, deep learning inference), referred to as *actions* in this paper, for improved energy and throughput. Today's SoCs integrate dozens of diverse HAs (e.g., 40+ HAs in Apple's A12 mobile SoC [4]).

Unfortunately, the energy and throughput improvements enabled by HAs come at the cost of increased design complexity. Ensuring that a given SoC will behave correctly and reliably requires verifying each and every constituent HA. Furthermore, HAs must achieve short design-to-deployment timelines in order to meet the needs of a wide variety of evolving applications [5]. Using conventional formal verification techniques to verify HAs faces several key challenges. Manually crafting extensive design-specific formal properties or full abstract functional specifications can be time-consuming and errorprone [6], [7]. Moreover, scaling verification to large HAs (with millions of logic gates) is difficult or even infeasible using off-the-shelf formal tools.

A recent formal verification technique targeting HAs, *Accelerator-Quick Error Detection (A-QED)* [8], overcomes the first challenge above. A-QED is readily applicable for a

popular class of HAs: *loosely-coupled accelerators* (*LCAs*) [9], [10] (i.e., HAs that are not integrated as part of a central processing unit (*CPU*), but via an SoC's network-on-chip or a bus) that are also *non-interfering*. Non-interfering HAs produce the same result for a given action independent of their context within a sequence of actions (not to be confused with combinational circuits). In other words, the state of the accelerator does not affect future computations, and each computation is independent from previous computations. In contrast, computations of *interfering* HAs depend on state that is the result of previous computations. A-QED uses Bounded Model Checking (*BMC*) [11] to symbolically check sequences of actions for *self-consistency*. Specifically, it checks for *functional consistency (FC)*, the property that identical inputs within a sequence of operations always produce the same outputs. It was shown that FC checks, together with *response bound (RB)* checks and *single-action correctness (SAC)* checks, provide a thorough verification technique for non-interfering LCAs [8]. However, despite its success in discovering bugs in moderately-sized HA designs, A-QED suffers from the scalability challenges of formal tools. For example, A-QED (backed by off-the-shelf formal verification tools) times out after 12 hours when run on NVDLA, NVIDIA's deep-learning HA [12] with approximately 16 million logic gates.

In this paper, we present a new verification approach called *A-QED with Decomposition (A-QED*<sup>2</sup> *)* to address the scalability challenge. First, we introduce a new, more general formal model of HA execution, which captures both interfering and noninterfering LCAs. We then show how A-QED<sup>2</sup> can *decompose* a large LCA into smaller *sub-accelerators* in such a way that both FC and RB checks can be directly applied to the subaccelerators. Unlike conventional verification approaches based on decomposition, no new properties need to be devised to apply FC and RB to the decomposed sub-accelerators. Existing decomposition approaches can be leveraged to additionally check SAC of the sub-accelerators. A-QED<sup>2</sup> is complementary to verification approaches that rely on design abstraction, which can be used to further improve scalability and to simplify the effort required for SAC checks on decomposed sub-accelerators.

This paper presents both a formal foundation of A-QED<sup>2</sup> and an empirical evaluation that demonstrates its bug-finding capabilities in practice. We prove that A-QED's completeness guarantees [8] continue to hold for A-QED<sup>2</sup>—if the full HA

under verification contains a bug, then A-QED<sup>2</sup> will detect that bug. Furthermore, we apply A-QED<sup>2</sup> to a wide variety of non-interfering LCAs (although our theoretical proofs apply to interfering LCAs as well): 109 different (buggy) versions of large open-source HAs of up to 200 million logic gates (including industrial HAs). Our empirical results focus on designs which are described in a high-level language (e.g., C/C++) and then translated to Register-Transfer-Level (*RTL*) designs (e.g., Verilog) using High-Level Synthesis (*HLS*) flows, where appropriate optimizations like pipelining and parallelism are instantiated. Such HLS-based HA design flows are becoming increasingly common in industry. However, A-QED<sup>2</sup> is not restricted to these specific HA design styles. Our empirical results show:


The rest of this paper is organized as follows. Sec. II presents related work. Sec. III presents a formal model of the accelerators targeted by A-QED<sup>2</sup> and our decomposition technique. Sec. IV details the A-QED<sup>2</sup> algorithms. Results are presented in Sec. V, and Sec. VI concludes.

### II. RELATED WORK

Conventional formal HA verification, e.g., [13]–[16], requires a specification, typically in the form of manually written, design-specific properties. These are then combined with a formal model of the design and handed to a formal tool, which attempts to prove the properties or find counter-examples. For the verification of latency-insensitive designs, an approach was developed to automatically derive and check properties from the RTL synthesized in HLS flows [17]. However, these derived properties are targeted at specific types of bugs.

Large design sizes have always been a challenge for formal techniques, and various approaches to this problem have been proposed. Among techniques to improve scalability are abstraction [18] and compositional reasoning (cf. [19]). The former removes details of the design, gaining scalability at the cost of possible false errors. Finding a scalable abstraction that does not generate false errors can be difficult and may be impossible in some cases. The latter uses *assume-guarantee* reasoning (e.g., [20]–[25]) and can be applied to decompose a large HA into smaller sub-modules. Importantly, the property p of the HA to be verified must also be decomposed into properties of the sub-modules. The properties of the submodules are verified individually under certain assumptions about the behavior of the other sub-modules. If all the properties of the sub-modules hold under the respective assumptions, then it can be concluded that p holds. However, finding the right properties for this decomposition can be very challenging.

Unlike for general compositional reasoning, the two main components of A-QED<sup>2</sup> (FC and RB) do not require decomposing properties. FC, in particular, leverages a universal *selfconsistency* property. Self-consistency expresses the property that a design is expected to produce the same outputs whenever it is provided with the same inputs [26]. In A-QED<sup>2</sup> , selfconsistency is checked independently for each sub-module (sub-accelerator in our case). Importantly, these aspects of A-QED<sup>2</sup> do not require complex assumptions about the behavior of the other sub-modules.

It is challenging to establish general *completeness guarantees* for conventional formal verification techniques [27]–[31], since completeness depends on the set of properties being checked. Designer-guided approaches [32], [33] require manual effort. Automatic generation of properties is usually incomplete and depends on abstract design descriptions [34] or models [35], or analysis of simulation traces [36], which may be difficult. In contrast, we have general completeness results for A-QED<sup>2</sup> .

A-QED<sup>2</sup> builds on A-QED [8] and leverages BMC [11], [37]. Similar approaches based on self-consistency have been successfully applied to other classes of hardware designs, such as processor verification (as *symbolic quick error detection (SQED)* [38]–[43]), as well as to hardware security [44]–[49].

#### III. FORMAL MODEL AND THEORETICAL RESULTS

In this section, we introduce a formal model for HAs, define functional consistency (*FC*), single-action correctness (*SAC*), and responsiveness for the model, and show how these properties provide correctness guarantees. We then define a notion of functional composition for our model and show how the above properties can be applied in a compositional way.

Our formal model differs from the one in previous work [8] in several important ways. It allows multiple inputs to be provided simultaneously by explicitly modeling the notion of *input batches*. The HAs we consider are *batch-mode accelerators* as they process input batches and produce output batches. Modeling batches is useful because it more closely matches the interfaces of real HAs. Moreover, input batches enable *intra-batch checks* for FC checking, as we describe below. With intra-batch checks, only one input batch is used for FC checking. Intra-batch checks are more restricted than general FC checks. However, they are easier to set up and run in practice, and they are highly effective at finding bugs, as we demonstrate empirically.

Our model also explicitly separates control states and memory states. Control states represent control-flow information

such as, e.g., program counters in HLS models of HAs. Memory states represent all other state-holding elements, e.g., program variables.

In our model we distinguish starting and ending control states in which inputs are provided and the computed outputs are ready, respectively. This makes the formulation simpler and is also a better match for HLS designs written in a highlevel language, which is our main target in the experimental evaluation. Further, our model enables us to formulate the notion of *strong FC*, which leads to a complete approach to bug-finding with only two input batches.

In previous work [8], a ready-valid protocol was used to model input/output transactions in RTL designs. In contrast, our focus is on HLS designs. Finally, we distinguish so-called *relevant states*, which are parts of the state space that can affect output values. This makes it possible to model interfering as well as non-interfering HAs. In our experiments we focus on non-interfering HAs.

Before presenting formal definitions, we illustrate terminology informally with an example of a non-interfering batchmode HA as shown in Listing 1 (a slightly modified excerpt of an HA implementing AES encryption [50]).

Function fun of the HA has two sub-accelerators in lines 8-10 and 13-14 which are identified and verified by A-QED<sup>2</sup> . Each sub-accelerator applies a certain operation to all inputs in an input batch of HA. In general, the *batch size* of an HA is the number of inputs in each batch, which is 256 for this HA. The first sub-accelerator ACC<sup>1</sup> processes an input batch provided via data and stores its output batch in buf. The second sub-accelerator ACC<sup>2</sup> takes its input batch from buf, where it also stores the output batch it produces. The control state of the HA is only implicitly represented by the program counter when executing function fun. Variables key and local\_key are global and determine the relevant state of the HA on which the result of the encryption operation depends. The HA is non-interfering because key and local\_key are left unchanged by ACC<sup>1</sup> and ACC<sup>2</sup> . Constants BS, UF, and US are used in HLS to configure the generated RTL.

Listing 1: HA Example (AES Encryption)

```
1 # d e f i n e BS ( ( 1 ) << 1 2 ) / / BUF_SIZE
2 # d e f i n e UF 2 / / UNROLL_FACTOR
3 # d e f i n e US BS / UF / / UNROLL_SIZE
4
5 v oi d f u n ( i n t d a t a [BS ] , i n t b u f [UF ] [ US ] , i n t key [ 2 ] ) {
6 i n t j , k ;
7 // ===ACC1 START===
8 f o r ( j = 0; j <UF; j ++ )
9 f o r ( k = 0 ; k < BS / UF; k ++ )
10 b u f [ j ] [ k ] = *( d a t a + i *BS + j *US + k ) ^ key [ 0 ] ;
11 // ===ACC1 END===
12 // ===ACC2 START===
13 f o r ( j = 0; j <UF; j ++ ) {
14 a e s 2 5 6 _ e n c r y p t ( l o c a l _ k e y [ j ] , b u f [ j ] ) ; }
15 // ===ACC2 END===
16 }
```
Definition 1. *A* batch-mode hardware accelerator (HA) *is a finite state transition system [51], [52]* Acc := (b, A, D, O, S, sc,I , sc,F , Sm,I , T)*, where*

• b ∈ N *with* b ≥ 1 *is the* batch size*,*

	- SIn = (A × D) <sup>b</sup> *are the* input states*,*
	- SOut = O<sup>b</sup> *are the* output states*,*
	- S<sup>R</sup> *are the* relevant states*, and*
	- S<sup>N</sup> *are the* non-relevant states*,*

When referring to different HAs, e.g., Acc<sup>0</sup> and Acc1, we use subscript notation to identify their components, e.g., Acc<sup>0</sup> := (b0, A0, D0, O0, S0, sc,I,0, sc,F,0, Sm,I,0, T0).

We use v = hv1, . . . , v<sup>|</sup>v<sup>|</sup>i to denote a sequence with elements denoted v<sup>i</sup> and length |v|. We concatenate sequences (and for simplicity of notation, single elements with sequences) using '·', e.g., v = v<sup>1</sup> · v 0 , where v <sup>0</sup> = hv2, . . . , v<sup>|</sup>v<sup>|</sup>i. We will sometimes identify a sequence v with the corresponding tuple, and we write v ∈ v to denote that v appears in v. We denote the i-th element of a tuple t as t(i).

An HA Acc operates on a set I <sup>b</sup> of *input batches*, where b is the *batch size* and I = A × D. An input batch in ∈ I <sup>b</sup> has b *batch elements*, each consisting of a pair (a, d) containing an action a ∈ A to be executed and data d ∈ D (the data on which action a operates).

A state s ∈ S of Acc with s = (sc, sm) consists of a control state s<sup>c</sup> ∈ S<sup>C</sup> and a memory state s<sup>m</sup> ∈ SM. The control state s<sup>c</sup> represents control-flow-related state (e.g., the program counter in an execution of a high-level model of Acc). In a run of Acc, the control state starts at a distinguished initial state sc,I and ends at a distinguished final state sc,F .

The memory state represents all other state-holding elements of Acc (including, e.g., global variables, local variables, function parameters, and memory elements). The memory state s<sup>m</sup> = (sin, sout, sr, sn) is divided into four parts. The first part, sin ∈ SIn, contains the input to Acc. More precisely, in a run of Acc, the value of sin in the initial state is considered the input for that run. Similarly, at the end of a run of Acc, sout ∈ SOut contains the outputs for that run (i.e., the values computed by Acc based on the inputs present at the start of the run).

The relevant state s<sup>r</sup> represents those state elements (other than sin) that can influence the values of the outputs. Any part of the state that can affect the output value in at least one execution should be included in the relevant state. As an example of when this is needed, consider an encryption HA with actions for setting the encryption key and for encrypting data. The internal state that stores the key is part of the relevant state because it affects the way the output is computed from the input. The non-relevant state s<sup>n</sup> is everything else. We write ctrl(s), mem(s), inp(s), out(s), rel(s), and nrel(s) to denote the components sc, sm, sin, sout, sr, and sn, respectively. We overload the latter four operators to apply to memory states as well, and we lift the notation to sequences of states.

The set S<sup>I</sup> of initial states contains all states resulting from combining a memory state in S<sup>M</sup> with the unique initial control state sc,I . The concrete initial states, SCI , are a subset of S<sup>I</sup> , and essentially represent the reset state(s) of the HA. They play a role in defining the *reachable* states (see Definition 3, below). The set S<sup>F</sup> of final states contains all states resulting from combining a memory state in S<sup>M</sup> with the unique final control state sc,F . Finally, the transition function T defines the successor state for any given state in S.

Given an input batch in ∈ I b , the HA produces an *output batch* o ∈ O<sup>b</sup> as follows. Let s<sup>0</sup> ∈ S<sup>I</sup> be an initial state with inp(s0) = in, and let s = T (s0) = hs1, . . . , ski denote the sequence of |s| = k *successor states* generated by the *transition function* T, where s<sup>i</sup> = T(si−1) for 1 ≤ i ≤ k, such that s<sup>k</sup> ∈ S<sup>F</sup> is a final state (and no earlier states in s are final states). We also assume, without loss of generality, that ctrl(si) 6= sc,I for i > 0. The final state s<sup>k</sup> holds the output batch out(sk) = o with o ∈ O<sup>b</sup> that is produced for the input batch inp(s0) = in. Given a sequence s, we write initsym(s) and final(s) to denote the subsequence of s containing all initial and final states that occur in s, respectively.

Given a sequence of input batches, an HA generates a sequence of output batches based on concatenating executions for each input batch.

Definition 2. *Let* in *be a sequence of inputs with* n = |in|*, and let* s<sup>0</sup> ∈ S<sup>I</sup> *. Then,* StateSeq(in, s0) *denotes the sequence of* successor states *of* s<sup>0</sup> *that result from executing* in*, which is defined as follows.*

• *Let* s 0 <sup>0</sup> *be the result of replacing* inp(s0) *with* in<sup>1</sup> *in* s0*. Let* s <sup>0</sup> *=* s 0 0 · T (s 0 0 )*.*

0

	- *let* s<sup>f</sup> = final(s 0 ) *(which is unique),*
	- *let* s<sup>i</sup> = (sc,I , mem(s<sup>f</sup> ))*,*
	- *let* s <sup>00</sup> = StateSeq(hin2, . . . , inni, si)*.* 0 00

HA Acc, we write StateSeq(Acc,in, s0) to explicitly refer to the successor states of s<sup>0</sup> generated by Acc. If Acc is clear from the context, we omit it.

Definition 3. *A state* s ∈ S *is* reachable *if* s ∈ SCI *or if there exists a* concrete initial state s<sup>0</sup> ∈ SCI *and sequence* in *of input batches such that* s ∈ StateSeq(in, s0)*. A relevant state* s<sup>r</sup> *is reachable if* s<sup>r</sup> = rel(s) *for some reachable state* s*.*

Note that the initial states S<sup>I</sup> are not necessarily all reachable.

Next, we define an abstract specification for an HA function. Note that we use this to define correctness, but one of the

features of A-QED is that the specification is not needed for the main verification technique.

Definition 4 (Abstract Specification). *For an HA* Acc*, let* Spec : I × S<sup>R</sup> → O *be an* abstract specification function*.*

Definition 4 states that the value of an output computed by an HA is completely determined by the corresponding input and the relevant part of the memory state when the HA was started. Note that the inclusion of the relevant memory state makes the definition general enough to model interfering HAs. To model non-interfering HAs, we can either make the output dependent on only the input batch, or require that the relevant state does not change in state transitions.

Based on the abstract specification, we define the *functional correctness* of an HA in terms of the output batches that are produced for given input batches as follows.

Definition 5 (Functional Correctness). *An HA* Acc *is* functionally correct *with respect to an abstract specification* Spec *if, for all concrete initial states* s<sup>0</sup> ∈ SCI *and all sequences* in *of input batches, if*


*then* ∀ j ∈ [1 . . . b]. on(j) = Spec(inn(j), rel(s<sup>I</sup> ,n))*.*

A bug is simply a failure of functional correctness.

As mentioned above, even without a formal specification, we can apply the core technique of A-QED. To do so, we leverage the concept of *functional consistency*, the notion that under modest assumptions, two identical inputs will always produce the same outputs.

Definition 6 (Functional Consistency (*FC*)). *An HA* Acc *is* functionally consistent *if, for all concrete initial states* s<sup>0</sup> ∈ SCI *and for all sequences* in *of input batches, if*


*then* ∀ i ∈ [1, n], j, j<sup>0</sup> ∈ [1, b]. ini(j)=inn(j 0 )∧rel(s<sup>I</sup> ,i)=rel(s<sup>I</sup> ,n) → oi(j)=on(j

0 )*.*

Definition 6 illustrates the need for the *relevant* designation for memory states. It essentially says that two inputs, even if started at different times and in different batch positions, should produce the same output, as long as the relevant part of the memory is the same when the two inputs are sent in. The following lemma is straightforward (see the online appendix [53] for proofs of this and other results).

Lemma 1 (Soundness of FC). *If an HA is functionally correct, then it is functionally consistent.*

Checking FC requires running BMC over multiple iterations of the HA and may be computationally prohibitive for large designs or for large values of n. Often, it is possible to verify a stronger property, which only requires checking consistency across two runs of the HA.

Definition 7 (Strong FC). *An HA* Acc *is* strongly functionally consistent *if, for all reachable initial states* s0, s<sup>0</sup> <sup>0</sup> *and input batches* in, in<sup>0</sup> *, if*

$$\mathbf{\bullet} \cdot \mathbf{s} = StateSeq(\langle in \rangle, s\_0), \ \mathbf{s'} = StateSeq(\langle in' \rangle, s'\_0),$$

\*  $\mathbf{s}\_F = final(\mathbf{s}) = \langle s\_F \rangle$ ,  $\mathbf{s}\_F' = final(\mathbf{s}') = \langle s\_F' \rangle$ ,  $\mathbf{o} = out(\mathbf{s}\_F) = \langle o \rangle$ ,  $\mathbf{o}' = out(\mathbf{s}\_F') = \langle o' \rangle$ ,

$$\begin{array}{ll} \text{then} \,\,\forall j, j' \in [1, b].\\ \,\,in(j) = in'(j') \land rel(s\_0) = rel(s'\_0) \to o(j) = o'(j'). \end{array}$$

The main difference between FC and strong FC is that the initial states s<sup>0</sup> and s 0 0 can be any reachable states. In contrast to that, the initial state s<sup>0</sup> ∈ SCI in the definition of FC is a concrete one. It is easy to see that strong FC implies FC, but the reverse is not true in general. This is because it may not be possible for two reachable initial states s<sup>0</sup> and s 0 0 chosen in a strong FC check to both appear in a single sequence of states resulting from executing a sequence of input batches starting in a concrete initial state. Similar to previous work on A-QED for non-batch-mode HAs [8], FC checking relies on sequences of input batches to reach all reachable states from a concrete initial state. For strong FC checking, on the other hand, two individual input batches are sufficient because the two initial states s<sup>0</sup> and s 0 0 can be arbitrarily chosen from the reachable states. Like FC, strong FC is a sound approach.

# Lemma 2 (Soundness of Strong FC). *If an HA is functionally correct then it is strongly functionally consistent.*

A challenge with using strong FC is that it requires starting with reachable initial states. However, we found that in practice (cf., Section V), it is seldom necessary to add any constraints on the initial states. This may seem surprising given the wellknown problem of spurious counterexamples that arises when using formal to prove functional correctness without properly constraining initial states. There are at least two reasons for this. First, many HAs have less dependence on internal state (none for non-interfering HAs) than other kinds of designs. But second, and more importantly, FC is a much more forgiving property than design-specific correctness. Many designs are functionally consistent, even when run from unreachable states. In fact, we believe that this is a natural outcome of good design and that designing for FC is a sweet spot in the tradeoff between design for verification and other design goals. If designers take care to ensure FC, even from unreachable states, then strong FC is both sound and easy to formulate.

Even simpler versions of the checks above can be obtained by making them *intra-batch* checks. An HA is *intra-batch functionally consistent* if it is functionally consistent when i = n = 1. That is, intra-batch FC checks are based on sending a single input batch to the HA. Consequently, it is not necessary to identify and compare the relevant parts of the initial states (cf. Definition 6) as there is precisely one initial state being used. Similarly, an HA is *intra-batch strongly functionally consistent* if it is strongly functionally consistent when s<sup>0</sup> = s 0 0 and in = in<sup>0</sup> . Again, only one input batch is sent to the HA and the relevant parts of the initial states are thus always equal. As we will show in Section V, intra-batch checks can be a very effective approach for cheaply finding bugs. Intra-batch checks are applicable only to batch-mode HAs; i.e., they are not applicable in the context of A-QED targeted at HAs processing sequences of single inputs [8] rather than input batches.

While functional consistency alone can find many bugs, it becomes a complete technique (i.e., it finds all bugs) by combining it with *single-action checks*.

Definition 8 (Single-Action Correctness (*SAC*)). *An HA* Acc *is* single-action correct (*SAC*) *with respect to an abstract specification* Spec *if, for every batch element* (a, d) *and for every reachable relevant state* sr*, there exists some reachable initial state* s*, such that* inp(s)(j) = (a, d) *for some* j*,* rel(s) = sr*, and* out(final(T (s)))(j) = Spec((a, d), sr)*.*

Essentially, SAC requires that for each action a, data d, and reachable relevant state sr, we have checked that the result is computed correctly when starting from some reachable initial state s whose relevant state matches sr. For every batch element (a, d) and sr, it is sufficient to run a single check where we can choose (a, d) to be at any arbitrary position j in the batch inp(s). Checking SAC *does* require using the specification explicitly, but these kinds of checks typically already exist in unit or regression tests. SAC may even be possible to verify using simulation. As we show in Section V, many bugs can be discovered without checking SAC at all.

When formalizing single-action checks, we again advocate using an over-approximation for reachability and encourage the design of HAs with simple over-approximations for the set of reachable relevant states. For the encryption example we gave above, the set of reachable relevant states is just the set of valid keys, which should be easy to specify.

In earlier work, using a slightly different HA model, we showed that SAC and functional consistency ensure correctness only when the HA is *strongly connected (SC)*, that is, when there exists a sequence of state transitions from every reachable state to every other reachable state. The same is true here.

Lemma 3 (Completeness of SAC + FC + SC). *If an HA is strongly connected and single-action correct and has a bug, then it is not functionally consistent.*

However, strong functional consistency leads to an even stronger result.

Lemma 4 (Completeness of SAC + Strong FC). *If an HA is single-action correct and has a bug, then it is not strongly functionally consistent.*

Finally, to address timeliness of results in addition to correctness, we define a notion of *responsiveness* for our model.

Definition 9 (Responsiveness). *An HA is* responsive with respect to bound n *if, for all concrete initial states* s<sup>0</sup> ∈ SCI *, sequences* in *of input batches, and input batches* in*, if*

$$\bullet \text{ } s = StateSeq(in, s\_0) = \langle s\_0, \dots, s\_m \rangle \text{ } and \text{ }$$

$$\mathbf{\bullet} \cdot \mathbf{s}' = StateSeq(in \, \mathbf{\uppi}, \, in, s\_0) = \langle s\_0, \dots, s\_{m+l} \rangle,$$

*then* l ≤ n*.*

#### *A. Decomposition for FC Checking*

We now show how FC of a decomposed design can be derived from FC of its parts. We first give conditions under which two HAs can be composed.

Definition 10 (Functionally Composable). Acc<sup>1</sup> *and* Acc<sup>2</sup> *are* functionally composable *if: (i)* b<sup>1</sup> = b2*; (ii)* O<sup>1</sup> = A<sup>2</sup> × D2*; (iii)* SC,<sup>1</sup> ∩ SC,<sup>2</sup> = ∅*; (iv)* SR,<sup>1</sup> = SR,2*; and (v)* SN,<sup>1</sup> = SOut,<sup>2</sup> × S 0 <sup>N</sup> *and* SN,<sup>2</sup> = SIn,<sup>1</sup> × S 0 <sup>N</sup> *for some* S 0 N *.*

Note in particular that composability requires that the outputs of Acc<sup>1</sup> match the inputs of Acc2. We also require that the two HAs have isomorphic memory states, which is ensured by including SOut,<sup>2</sup> in the non-relevant states of Acc<sup>1</sup> and SIn,<sup>1</sup> in the non-relevant states of Acc2. In order to map a memory state of Acc<sup>1</sup> to the corresponding memory state in Acc2, we define a mapping function α : SM,<sup>1</sup> → SM,<sup>2</sup> as follows: α(sm) = (out(sm), nrel(sm)(1), rel(sm),(inp(sm), nrel(sm)(2))). We next define functional composition.

Definition 11 (Functional Composition, Sub-Accelerators). *Given functionally composable HAs* Acc<sup>1</sup> *and* Acc2*, we define the* functional composition Acc<sup>0</sup> = Acc<sup>2</sup> ◦ Acc<sup>1</sup> *(*Acc<sup>1</sup> *and* Acc<sup>2</sup> *are called* sub-accelerators *of* Acc0*) as follows:* b<sup>0</sup> = b1*,* A<sup>0</sup> = A1*,* D<sup>0</sup> = D1*,* O<sup>0</sup> = O2*,* SC,<sup>0</sup> = SC,<sup>1</sup> ∪ SC,2*,* SM,<sup>0</sup> = SM,1*,* sc,I,<sup>0</sup> = sc,I,1*,* sc,F,<sup>0</sup> = sc,F,2*,* Sm,I,<sup>0</sup> = Sm,I,1*. The transition function is defined as follows.* T0(sc, sm) =


Definition 11 essentially states that an execution of Acc<sup>0</sup> = Acc<sup>2</sup> ◦ Acc<sup>1</sup> is obtained by first running Acc<sup>1</sup> to completion, then passing the outputs of Acc<sup>1</sup> to the inputs of Acc2, and then running Acc<sup>2</sup> to completion. As a variant of Definition 11, it is also possible to define functional composition where the sub-accelerators operate in parallel. This way, the subaccelerators process non-overlapping parts of a given input batch and produce the respective non-overlapping parts of the output batch.

We now introduce a compositional version of FC.

Definition 12 (Strong FC for Decomposition (*FCD*)). *An HA* Acc *is* strongly functionally consistent for decomposition (strongly FCD) *if it is strongly functionally consistent and, in addition to* o(j) = o 0 (j 0 )*, the property* rel(s<sup>F</sup> ) = rel(s 0 F ) *holds in the conclusion of the implication in Definition 7.*

Note that strong FCD is stronger than strong FC. In order to stitch together results on sub-accelerators, we need to establish that not only the output but also the relevant memory state is the same after processing identical inputs. The following is clear from the definition.

Corollary 1. *If an HA* Acc *is strongly FCD, then* Acc *is strongly FC.*

We now show that composition preserves strong FCD and then state our main result.

Lemma 5 (Functional Composition and Strong FCD). *Let* Acc<sup>0</sup> = Acc<sup>2</sup> ◦Acc1*. If both* Acc<sup>1</sup> *and* Acc<sup>2</sup> *are strongly FCD then* Acc<sup>0</sup> *is strongly FCD.*

Theorem 1 (Completeness of A-QED<sup>2</sup> ). *Let* Acc0, Acc1*, and* Acc<sup>2</sup> *be HAs such that* Acc<sup>0</sup> = Acc<sup>2</sup> ◦ Acc<sup>1</sup> *and* Acc<sup>0</sup> *is single-action correct. If* Acc<sup>1</sup> *and* Acc<sup>2</sup> *are strongly FCD then* Acc<sup>0</sup> *is functionally correct.*

Theorem 1 states that A-QED<sup>2</sup> is complete. That is, by contraposition, if an HA Acc<sup>0</sup> has a bug, i.e., it is not functionally correct, then either Acc<sup>1</sup> or Acc<sup>2</sup> is not strongly FCD, and thus the bug can be detected by A-QED<sup>2</sup> .

Note that there is no corresponding soundness result. This is because it is possible to decompose a functionally consistent HA into functionally inconsistent sub-accelerators. However, as shown in Section V, this appears to be rare in practice, and here again we reiterate our position on design for verification and advocate that also sub-accelerators should be designed with functional consistency in mind.

Functional composition can easily be generalized to more than two sub-accelerators. Moreover, it can be applied recursively to further decompose sub-accelerators. If functional decomposition based on Definition 11 is not applicable to further decompose a sub-accelerator, then such a sub-accelerator can be decomposed using existing formal decomposition approaches, though these require significant manual effort. Our approach identifies conditions under which simple, automatable decomposition of FC checking is possible.

# IV. A-QED<sup>2</sup> FUNCTIONAL DECOMPOSITION IN PRACTICE

We now present our implementation of A-QED<sup>2</sup> , which builds on the theoretical framework of the previous section. We combine functional decomposition with checks for FC (dFC), SAC (dSAC), and responsiveness (dRB).

#### *A. Decomposition for FC: dFC*

dFC takes as input a non-interfering LCA design Acc (satisfying Definitions 1 and 2) together with designer-provided annotations (explained in this section). dFC decomposes Acc into sub-accelerators (following Definition 11). FC checks are run on the sub-accelerators and any counterexamples are reported. Note that the way in which Acc is actually decomposed into sub-accelerators has no influence on the completeness of A-QED<sup>2</sup> (Theorem 1). That said, FC checks may scale better for certain decompositions. While failing FC checks expose consistency issues at the sub-accelerator level, it is possible that they do not cause incorrect behaviors at the full Acc level. However, we did not observe any instances of this in our experiments.

Our dFC implementation relies on identifying *batch operations* in a given Acc. A batch operation operates on a vector of inputs, applying some action to each input in order to produce a vector of outputs. The input to a batch operation could be an intermediate output batch of another sub-accelerator or an input batch to Acc itself. A batch operation produces either an

intermediate output batch which is subsequently processed by another sub-accelerator or an output batch of Acc itself.

We assume that Acc is expressed in a high-level language, specifically as a C/C++ program<sup>1</sup> that implements sequential computation of Acc outputs from Acc inputs.<sup>2</sup> Batch operations in the C/C++ program are identified by finding contiguous C/C++ statements called *functional blocks* that implement those batch operations. Each functional block represents a sub-accelerator.

We have developed a set of annotations by which the designer can help identify these functional blocks. Examples of such annotations are given in Listing 2 (extends Listing 1). It has two functional blocks corresponding to batch operations: lines 15-17 and 32-33.

Annotations are defined by particular keywords that are prefixed by "%" (and denoted in blue) in Listing 2. These annotations describe the compute and memory access patterns of the functional block as it transforms an input batch into an output batch. In practice, hardware designers already use similar annotations frequently, e.g., to express parallelization opportunities for HLS to generate efficient hardware. As a result, we expect manageable effort in creating such annotations to support dFC. The HLS research community is actively developing new techniques to automatically explore the HA design space and derive optimal design points together with appropriate parallelization and pipelining [54]–[56]. With tight integration of A-QED<sup>2</sup> with HLS, we expect that it will be possible to generate dFC annotations with low effort.

Listing 2: C/C++ Annotation Example (AES Encryption)

```
1 # d e f i n e BS ( ( 1 ) << 1 2 ) / / BUF_SIZE
2 # d e f i n e UF 2 / / UNROLL_FACTOR
3 # d e f i n e US BS / UF / / UNROLL_SIZE
4
5 v oi d f u n ( i n t d a t a [BS ] , i n t b u f [UF ] [ US ] , i n t key [ 2 ] ) {
6 i n t j , k ;
7
8 %IN_SIZE 16 / / v a r i a b l e s p e r i n p u t b at c h el e m e nt
9 %IN_BATCH_SIZE BS / IN_SIZE / / i n p u t b at c h s i z e
10 %BATCH_MEM_IN d a t a / / i n p u t b at c h s o u r c e
11 %IN_ALLOC_RULE i n ( x ) a d d r r a n g e =
12 [ i *BS + x*IN_SIZE :
13 i *BS + ( x + 1 ) *IN_SIZE ] / / BATCH_MEM_IN l a y o u t
14 // ===ACC1 START===
15 f o r ( j = 0; j <UF; j ++ )
16 f o r ( k = 0 ; k < BS / UF; k ++ )
17 b u f [ j ] [ k ] = *( d a t a + i *BS + j *US + k ) ^ key [ 0 ] ;
18 // ===ACC1 END===
19 %OUT_SIZE 16 / / v a r i a b l e s p e r o u t p u t b at c h el e m e nt
20 %OUT_BATCH_SIZE BS / OUT_SIZE / / o u t p u t b at c h s i z e
21 %BATCH_MEM_OUT b u f / / o u t p u t b at c h s o u r c e
22 %IN_ALLOC_RULE o ut ( x ) a d d r r a n g e =
23 [ x / US ] [ ( x%US ) *OUT_SIZE :
24 ( ( x + 1 )%US ) *OUT_SIZE ] / / BATCH_MEM_OUT l a y o u t
25
26 %IN_SIZE 16
27 %IN_BATCH_SIZE BS / IN_SIZE
28 %BATCH_MEM_IN b u f
29 %IN_ALLOC_RULE i n ( x ) a d d r r a n g e =
30 [ ( x%US ) *IN_SIZE : ( ( x + 1 )%US ) *IN_SIZE ] [ x / US ]
```
<sup>1</sup>HAs expressed in Verilog or SystemC can be converted into C/C++, and then our dFC implementation can be applied. We do this in Sec. V.

<sup>2</sup>Existing HLS tools (e.g., Xilinx Vivado HLS, Mentor Catapult HLS) can then optimize Acc, incorporate appropriate pipelining and parallelism, and produce Verilog for subsequent logic synthesis and physical design steps. Such HLS-based HA design flows are becoming increasingly common.

```
31 // ===ACC2 START===
32 f o r ( j = 0; j <UF; j ++ ) {
33 a e s 2 5 6 _ e n c r y p t ( l o c a l _ k e y [ j ] , b u f [ j ] ) ; }
34 // ===ACC2 END===
35 %OUT_SIZE 16
36 %OUT_BATCH_SIZE BS / OUT_SIZE
37 %BATCH_MEM_OUT b u f
38 %OUT_ALLOC_RULE o ut ( x ) a d d r r a n g e =
39 [ ( x%US ) *OUT_SIZE : ( ( x + 1 )%US ) *OUT_SIZE ] [ x / US ]
40 }
```
From the annotations, we create sub-accelerators. For example, the annotations in Listing 2 generate two sub-accelerators: Acc<sup>1</sup> corresponding to the functional block in Lines 15-17 with annotations in Lines 8-13 and 19-24, and Acc<sup>2</sup> corresponding to the functional block in Lines 32-33 with annotations in Lines 26-30 and 35-39. For each sub-accelerator, we create an *A-QED*<sup>2</sup> *module* for FC checking.<sup>3</sup> It generates symbolic inputs for the sub-accelerator and symbolically executes the corresponding functional block in order to produce symbolic expressions for the outputs. For strong FC checks (Definitions 6 and 7), the relevant states (Definition 1) must additionally be identified and explicitly constrained to be consistent across sub-accelerator calls processing two input batches. Identifying the relevant states is not necessary for intra-batch FC checks (discussed in the context of Lemma 2). For example, in subaccelerator Acc<sup>1</sup> in Listing 2, *key[0]* is a relevant state element (distinct from the batch input *data*). Between two calls of Acc<sup>1</sup> during a strong FC check, *key[0]* must be consistent. In our implementation, we ignore reachability and allow all checks to start from fully symbolic initial states. This does not lead to spurious counterexamples in our experiments.

#### *B. Decomposition for RB: dRB*

The sub-accelerators for A-QED<sup>2</sup> 's RB checks (Definition 9) can be (and often are) different from those for FC because RB involves a much simpler check: *some* output is produced within the response bound n. We expect n to be provided by the designer for the top-level accelerator. We then use the same bound n for each sub-accelerator. The rationale is that if a sub-accelerator fails an RB check, then the full accelerator would also fail the same RB check.

For dRB, we generate a static single assignment (SSA) representation of the design. We then apply a *sliding window algorithm* to dynamically generate sub-accelerators. Lines of code in the SSA that fall within a certain *window* W form the sub-accelerator. Due to SSA form, the inputs of this subaccelerator are variables that are never updated or assigned in W while the outputs are the variables which update variables outside W. The current size of W is given by the number of LOCs that fit in W, and it changes dynamically during a run of the algorithm to incorporate the largest sub-accelerator that will fit the BMC tool. Once the sub-accelerator is verified, W slides by δ LOCs (δ is a parameter) and adjusts its boundary to get the next largest sub-accelerator that can be verified. We synthesize that sub-accelerator using HLS (since some responsiveness bugs only manifest after HLS) and then run RB checks using BMC. The initial states of each generated

<sup>3</sup>See the online appendix [53] for details.

sub-accelerator are left unconstrained (i.e., fully symbolic) in order to analyze all possible behaviors. The specific size of W and its position in the SSA code change dynamically as dRB proceeds. dRB terminates when W reaches the end of the SSA code or if at any time an RB check fails.

#### *C. Decomposition for SAC: dSAC*

As mentioned above, and as will be shown in the next section, many bugs can be detected using only dFC and dRB. The advantage of this is that both of these checks can be run without any functional specification. dSAC completes the story, but at the cost of requiring specifications. We use standard functional decomposition techniques (essentially, writing preconditions, invariants, and postconditions) to decompose SAC checks. One feature of dSAC is that only a single input in a batch needs be checked—all other inputs in the batch can be set to constants (we use zero in our experiments). This makes both writing the properties and checking them much simpler. The non-input part of the initial state for each check is again kept fully symbolic for simplicity. If a sub-accelerator is too big, we further decompose it using finer-grained functional blocks.

#### V. EXPERIMENTAL RESULTS

We demonstrate the practicality and effectiveness of A-QED<sup>2</sup> for 109 (buggy) versions of several non-interfering LCAs,<sup>4</sup> including open-source industrial designs [12]. We selected these designs for the following reasons:


Many of the designs were already available in sequential C or C++. We converted Verilog and SystemC designs into sequential C. To facilitate dFC, we manually inserted annotations (like those in Listing 2). For A-QED FC, we used CBMC for all designs originally represented in sequential C or C++. For designs in Verilog and SystemC, we used Cadence JasperGold (SystemC designs converted to Verilog via HLS). For A-QED<sup>2</sup> FC and SAC checks, we used CBMC version 5.10 [66]. For A-QED and A-QED<sup>2</sup> RB checks, we used Cadence JasperGold version 2016.09p002 on Verilog designs generated by the HLS tools used by the designers. Lastly, we used Frama-C [67] to check for initialization and out-of-bounds bugs on the entire C/C++ designs. We ran all our experiments on Intel Xeon E5-2640 v3 with 128GBytes of DRAM.

Tables I, II, and III summarize our results. We present comparisons between A-QED<sup>2</sup> (dFC, dRB, dSAC) and A-QED (FC, RB, SAC). Table I also compares A-QED<sup>2</sup> intra-batch FC vs. A-QED<sup>2</sup> strong FC (cf. details in the online appendix [53]).

Observation 1: HAs from various domains (including industry) show that non-interfering LCAs are highly common.

Observation 2: The vast majority of the studied HAs are too big for existing off-the-shelf formal verification tools, for both A-QED and conventional formal property verification.

Observation 3: Table I shows that A-QED<sup>2</sup> intra-batch FC checks detected bugs inside sub-accelerators (with batch sizes > 1) very quickly—under a minute for almost all of the designs, and just over a minute for nv\_large. For most batchmode sub-accelerators—except two for each of the following four designs (amounting to eight sub-accelerators in total): grayscale64, grayscale32, mean128, and mean32—intra-batch dFC checks were easily completed using off-the-shelf formal tools. Strong FC checks incur more complexity. Hence, the formal tool timed out after 12 hours for 62 sub-accelerators when running strong FC checks, distributed across multiple designs. Empirically, we found that intra-batch FC checks detected all bugs that were detected by strong FC checks.

Observation 4: A-QED<sup>2</sup> RB and A-QED<sup>2</sup> SAC are also highly effective in detecting bugs inside sub-accelerators. For the first 11 designs (AES to gsm) in Table II, we do not expect unresponsiveness bugs (confirmed by simulations). Hence, A-QED<sup>2</sup> RB checks ran for 12 hours (for increasingly longer input sequences) without detecting unresponsiveness. For designs with RB bugs, A-QED<sup>2</sup> RB checks on sub-accelerators were able to detect those in less than 11 minutes on average. For A-QED<sup>2</sup> dSAC, we observed that a significant fraction (26 out of 46 bugs (56%)) of these bugs were also detected by A-QED<sup>2</sup> FC checks. Thus, FC alone is effective at catching a wide variety of bugs.

Observation 5: A-QED<sup>2</sup> detected all bugs that were detected by conventional (simulation-based) verification techniques. Further, all counterexamples produced from verifying subaccelerators corresponded to real accelerator-level bugs. Compared with traditional simulation-based verification, we report a ∼ 5X improvement in verification effort on the average, with a ∼ 9X improvement for the large, industrial NVDLA designs. The overhead of inserting our annotations for dFC can be small compared to what designers already insert to optimize the design. For ISmartDNN, for example, the total number of annotations is 304, which is 2.8% of the total lines of code of the design. In the code of the HLS designs we considered, pragmas amount to 11% on average. We also observe a ∼ 60X improvement in average verification runtime compared to conventional simulations.<sup>5</sup>

#### VI. CONCLUSION

Our theoretical and experimental results demonstrate that A-QED<sup>2</sup> is an effective and practical approach for verification

<sup>4</sup>See the online appendix [53] for design details and the software artifact [65].

<sup>5</sup>The conventional verification effort for NVDLA was based on start and end commit dates in its nv\_small Github repository. The conventional verification runtime for NVDLA, ISmartDNN, and dnn HAs were obtained by running the available simulation tests on our platform. The remaining runtime and effort information were provided by the designers.


TABLE I: Avg. RunTimes of FC checks for A-QED and A-QED<sup>2</sup> . For A-QED<sup>2</sup> , sub-accelerator counts are provided, including the Total count that resulted from dFC decomposition, the count with batch sizes greater than one (i.e., Parallel), the count (with batch sizes greater than one) for which FC checks were successful on 1 and 2 batches for intra-batch FC and strong FC respectively, and the count for which Bugs were detected by FC checks. For A-QED FC, experiments could not complete FC check for a single batch in 12 hours (timeout) or exhibited out-of-memory (OOM) errors before timeout. Average runtimes result from dividing the time to detect all bugs by the number of bugs. † keypair [59], gsm [60], HLSCNN [61], FlexNLP [62], Dataflow [63], and Opticalflow [64] all time out for A-QED FC and do not contain any sub-accelerators with batch size greater than one. One OOB bug was detected in gsm and one initialization bug in keypair.


TABLE II: RB checks for A-QED and A-QED<sup>2</sup> . For A-QED<sup>2</sup> , sub-accelerator counts produced by dFC are provided, as in Table I. A-QED<sup>2</sup> RB checks are performed on all sub-accelerators regardless of batch size, so P is omitted compared to Table I. For A-QED RB, RB checks did not complete even for a input sequence length of 1 within 12 hours (timeout). Sub-accelerators for which RB checks for at least input sequence length of 1 was completed were considered Complete. For the first 11 designs, from AES to gsm, no bugs related to unresponsiveness were detected by traditional simulationbased verification. Results are omitted for nv\_large and nv\_small; responsiveness related bugs generally result from parallelism and pipelining, both of which were lost in our manual translation of NVDLA from Verilog to sequential C code.

of large non-interfering LCAs. A-QED<sup>2</sup> exploits A-QED principles to decompose a given HA design into sub-accelerators such that A-QED can be naturally applied to the sub-accelerators. A-QED<sup>2</sup> is especially attractive for HLS-based HA design flows. A-QED<sup>2</sup> creates several promising research directions:



TABLE III: SAC checks for A-QED<sup>2</sup> . Sub-accelerator counts produced by dSAC are provided, as in Table I. A-QED<sup>2</sup> SAC checks were performed on all sub-accelerators regardless of batch size, so P is omitted compared to Table I.


#### ACKNOWLEDGMENT

This work was supported by the DARPA POSH program (grant FA8650-18-2-7854), NSF (grant A#:1764000), and the Stanford SystemX Alliance. We thank Prof. David Brooks, Thierry Tambe and Prof. Gu-Yeon Wei from Harvard University, and Kartik Prabhu and Prof. Priyanka Raina from Stanford University for their design contributions in our experiments.

#### REFERENCES


# Sound and Automated Verifcation of Real-World RTL Multipliers

Mertcan Temel *Electrical and Computer Engineering University of Texas at Austin* Austin, TX, USA mert@utexas.edu

Warren A. Hunt, Jr. *Computer Science University of Texas at Austin* Austin, TX, USA hunt@cs.utexas.edu

*Abstract*—We have developed an algorithm, S-C-Rewriting, that can automatically and very effciently verify arithmetic modules with embedded multipliers. These include ALUs, dotproduct, multiply-accumulate designs that may use Booth encoding, Wallace-trees, and various vector adders. Outputs of the target multiplier designs might be truncated, right-shifted, or a combination of both. We evaluate the performance of other state-of-the-art tools on verifcation problems beyond isolated multipliers and we show that our method applies to a broader range of design techniques encountered in real-world modules. Our verifcation software is verifed using the ACL2 theorem prover, and we can soundly verify 1024x1024-bit isolated multipliers and similarly large dot-product designs in minutes. We can also generate counterexamples in case of a design bug. Our tool and benchmarks are available online.

*Index Terms*—Formal Verifcation, Integer Multipliers, Hardware Verifcation, Arithmetic Circuits, ACL2, Term-rewriting

### I. INTRODUCTION

Integer multipliers are fundamental building blocks for general-purpose (e.g., CPUs and GPUs), image, communications, and cryptographic processors. Multipliers are used to implement dot-product, division, square-root, and foatingpoint operations; in turn, these operations fnd their way into graphics, cryptography, and signal processing systems. In some cases, such as cryptographic processors, integer multipliers might be used to multiply numbers as large as 1024 bits.

Given the ubiquity of multipliers, it is crucial to have a sound verifcation method for designs that include multipliers. However, the formal verifcation process of multipliers is still a challenge, especially for the most common design approaches such as Wallace tree and Booth encoding. Decision-procedurebased tools such as BDDs, SAT solvers do not scale [1], [2]. In recent years, multiplier verifcation efforts have shifted towards using computer algebra methods [2]–[6] and they have yielded more promising results. However, these studies focused heavily on isolated multiplier designs, and they do not perform well (if at all) for multipliers with truncated output (e.g., a 32x32-bit multiplier with a 32-bit output). Studies that explore the verifcation problem of embedded multipliers (e.g., multiply-accumulate, dot-product) have been limited, and they do not support designs with Wallace tree and Booth encoding [1]. Additionally, only one computer-algebra-based tool [3] provides a system to check the correctness of the proof itself, leaving open the possibility that these tools might claim a design to be correct when the design is actually fawed.

In our previous work [7], we proposed a method to verify integer multipliers effciently and automatically. Using the ACL2 theorem proving system, we developed a provably correct verifcation mechanism based on term-rewriting. This method has been shown to quickly verify a wide range of integer multiplier designs (e.g., 1024x1024-bit multipliers with simple partial products have been verifed in less than 10 minutes). However, our focus concerned only untruncated isolated multiplier designs. Moreover, we did not discuss how the algorithm performs with buggy designs.

We have expanded our method and we have been able to:


Additionally, we retain the same level of proof automation and keep our tool provably correct.

In this paper, we aim to explore the verifcation problem of multipliers on more complex designs than explored in previous verifcation studies and deliver our solutions. We provide examples of complex multiplier architectures with optimizations that can be encountered in real-world designs. We discuss how existing state-of-the-art verifcation tools perform on such modules. Finally, we present our improved method and show that we can verify these complex designs very effciently. For example, we can verify 64x64-bit isolated multipliers or similar designs within seconds and 1024x1024 bit isolated multipliers or similar dot-product designs in 5 minutes, no matter which design algorithm is used.

This paper is structured as follows. Sec. II summarizes the most common design algorithms for isolated and embedded multipliers. We show why it is important to develop a verifcation method for embedded and truncated multipliers and why it is not enough to have a verifcation tool only for isolated multipliers. In Sec. III, we summarize the related work from the most recent and/or prominent studies. Sec. IV recapitulates our term rewriting algorithm from our previous work and introduces some of its recently discovered limitations. Sec. V discusses our new improvements so that we can verify more designs with better effciency and generate counterexamples

for buggy modules. Sec. VI describes how our lemmas are implemented and applied. Finally, we show our experiment results in Sec. VII and compare our performance with other state-of-the-art multiplier verifcation tools.

#### II. MULTIPLIER ARCHITECTURES

There are various algorithms to design RTL multipliers and integrate them in other arithmetic modules such as a multiplyaccumulate (MAC). The diffculty of verifying these modules depends on the design algorithm. Some algorithms bring out clean and regularly structured modules, and some and most commonly used algorithms produce complex structures. This section elaborates on the verifcation problem by summarizing common algorithms to design multipliers and how they are implemented in other arithmetic circuits.

#### *A. Isolated Multipliers*

An isolated multiplier is a circuit with two bit-vector inputs and one bit-vector output. The output vector represents an integer equivalent to the multiplication of the input vectors, which can be signed or unsigned integers. Isolated multipliers are often implemented in two stages: partial product generation and partial product summation.

Partial products can be generated by multiplying (i.e., logical AND) each input bit with each other as in primary school multiplication. For signed numbers, the input numbers need to be sign-extended, in which case the Baugh-Wooley [8] sign extension technique can be used to lower the implementation area. Booth encoding [9] (particularly radix-4) is a more common and effcient way to generate partial products. Booth encoding incorporates more than two input bits at a time when generating partial products. This can provide more parallelism and fewer partial products. However, Booth encoding makes a circuit's structure and logic more complex, making it more diffcult to reason about the circuit.

There are numerous methods to sum partial products in hardware. Unlike primary school multiplication, hardware algorithms do not sum partial products one column at a time, from right to left. Summations are performed more locally with unit adders such as half and full adders. An array multiplier is a simple example that is built with such unit adders following a shift-and-add methodology. Array multipliers have a regular structure, which makes it straightforward to verify them. However, they can have a large gate delay (i.e., propagation delay). On the other hand, Wallacetree-like multipliers [10], such as Dadda tree [11], provide more parallelism. These summation tree algorithms sum partial products with less propagation delay and only slight changes in the implementation area. Designers can also utilize low gate-delay vector adders, such as Brent-Kung [12], Ladner-Fischer [13], and conditional sum, as a fnal stage adder to get the multiplication result. This can make Wallace-tree-like algorithms with complex fnal stage adders more preferable for hardware applications, but their irregular structures make the verifcation problem diffcult, especially when paired with Booth encoding.

We should also note that an isolated multiplier implementation may not always return the full multiplication result. Instead, the result might be truncated, right-shifted, or a combination of both. For example, when two 32-bit numbers are multiplied, a lossless multiplier would output a 64-bit number. On the other hand, if the design only calculates the lower, say, 32-bits of the result, we say that the result is truncated. Similarly, when, say, only the upper 32-bits of the result are returned from the multiplier, we say that the result is right shifted. If only the middle portion of the result is returned, which may happen in fxed-point arithmetic, we say that the result is right shifted and truncated. Some designs implement rounding or saturation when a certain portion of the result is discarded when truncating and/or shifting.

#### *B. Simple Arithmetic Modules with Embedded Multipliers*

Integer multipliers can be implemented in various arithmetic modules such as MAC, dot-product, and foating-point arithmetic units. This section summarizes how a MAC module can be implemented in hardware.

A simple MAC computes a∗b+c, where a, b and c are bitvectors. When designing a MAC module, one may implement an isolated multiplier that computes a ∗ b and a vector adder that adds c to the multiplier's output. To verify such a MAC module, one can decompose the design, use different tools to verify the isolated multiplier and the fnal adder separately, and compose the proofs to show that the overall MAC module is correct. However, this design methodology uses two vector adders consecutively (one vector adder as part of the isolated multiplier and one for adding c). Vector adders can make up a large portion of the gate delay (and/or area) in such circuits, and this design technique can increase the gate delay considerably, making this approach a poor design choice.

Fig. 1. An effcient way to compute MAC result

Fig. 1 shows an alternative approach that uses only one vector adder. This MAC module does *not* implement a complete isolated multiplier. Instead, it uses an *incomplete* multiplier. We defne incomplete multipliers as modules that multiply two bit-vectors but do not use a fnal stage adder to return the complete multiplication result; instead, they return the two bit-vectors generated after the Wallace-tree reduction (summing these two vectors would give the multiplication result). This output form is also referred to as *redundant* form. After the incomplete multiplication, the two bit-vector outputs are summed together with the addend (c) using another Wallace tree and a vector adder. This can be a preferable design approach as it provides better gate-delay performance. However, it removes the boundaries between multiplication and summation, which complicates the job of a verifcation engineer. Further complicating verifcation, an alternative design technique may sum c with the initial partial products with a single Wallace-tree and vector adder, which can remove the boundaries even further. In such cases, we cannot simply decompose the design and use a multiplier verifcation tool that works only with isolated multipliers.

We can see similar design methodologies in other modules. For example, a dot product design may use multiple *incomplete* multiplier modules and sum all the output vector pairs together in another summation tree using a Wallacetree and a fnal stage adder. This method would prevent the increase in area and gate delay by using only one fnal stage adder in the overall design. Similarly, a foating-point module implementing FMA (fused multiply-add) may use an incomplete integer multiplier.

#### *C. Multi-purpose Multipliers*

Some processing units may implement multipliers for various arithmetic operations with different operand sizes. For example, x86 chips have many integer multiplication instructions such as PMADDWD (multi-lane multiply and add together, in other words, dot-product), PMULHW (multi-lane multiply and store upper half of the result), and PMULLW (multilane multiply and store lower half). Multiplier circuits can occupy a large implementation area, and it is common for such instructions to share resources and reuse multiplier modules.

We have created an example arithmetic circuit that shows how multiplier modules can be reused for different operations. We call this arithmetic unit *integrated multipliers* whose schematic diagram is shown in Fig. 2. This design multiplexes various multipliers and adders to perform 4-point 32-bit dotproduct, 1-lane 64-bit multiply-accumulate, or 4-lane 32-bit multiply-accumulate with options to return lower or upper signifcant halves of the result. This module also includes an accumulator register that can be used, for example, to perform an 8-point 32-bit dot-product in two clock cycles, or 12-point 32-bit dot-product in three clock cycles, and so on. The mode of operation is determined by the control signal mode.

This module implements four identical 32x32-bit *incomplete* multipliers whose inputs are two 32-bit numbers with an additional sign bit and whose outputs are two bit-vectors. Depending on the mode of operation, the outputs of these multipliers are summed with another summation tree, and the fnal result is calculated with vector adders. The datapaths for 32-bit MAC and dot-product operations are as described in the previous section (Sec II-B). This module also supports 64-bit operands, in which case the outputs of the 32x32-bit incomplete multipliers are appropriately shifted, sign-extended, and summed to calculate the 64x64-bit multiplication result. We call such operations *merged multiplication*, where multiple

Fig. 2. The circuit diagram of integrated multipliers, our example arithmetic unit.

smaller multipliers are used to implement a larger multiplier. The module can also add a number to the 64x64-bit multiplication result and make this a 64-bit MAC operation.

We can verify this design for each possible mode of operation. For example, we can set the mode signal to perform dot product and check if the result matches the mode's specifcation. Industrial designs are often much more intricate than this module; however, it is often possible to reason about one arithmetic operation at a time. Then, the verifcation problem becomes as complex as verifying a single arithmetic operation.

#### III. RELATED WORK

The verifcation problem of multipliers continues to have a great deal of research interest, and researchers offer new techniques every year. This section covers the most recent and prominent studies that attempt to solve this problem, particularly for RTL designs with Booth encoding and Wallace-treelike structures.

#### *A. BDDs, BMDs, SAT and SMT Solvers*

Automated and well-studied generic tools and methods such as BDDs, SAT, and SMT Solvers can theoretically be used to verify multiplier designs. However, it has been shown that these methods do not scale for designs larger than 12x12 bit multipliers [1], [2]. SAT solvers may scale better when generating counterexamples for buggy designs. Some success has been achieved with BMDs but only for regularly structured multipliers [14]. On the other hand, these automated tools may be used to verify some multiplier design components, such as the fnal stage adder [3].

#### *B. Computer Algebra Methods*

In computer algebra-based methods, multiplier circuits are modeled with a set of polynomials. Basic logical gates of a circuit are represented in terms of algebraic expressions (e.g., ∀x, y ∈ {0, 1} x ∨ y = x + y − xy ) as well as the multiplication result (see Example 1 for a 2x2-bit unsigned multiplier specifcation). The algebraic representation on its own does not scale when verifying multipliers. Researchers implement various heuristics and optimizations that are specifc to multiplier designs to achieve effcient and practical results. A notable optimization is identifying the logic from adder modules implemented in target multiplier designs [3], [4], [15]–[17].

### Example 1. 4a1b<sup>1</sup> + 2a1b<sup>0</sup> + 2a0b<sup>1</sup> + a0a<sup>0</sup>

Computer algebra methods have made a lot of progress towards the multiplier verifcation problem. However, these studies have focused mainly on isolated multipliers with untruncated outputs and the same operand sizes (nxn-bit multipliers with 2n-bit outputs). This makes it more diffcult to utilize them for real-world designs where truncation, shifting, and integration with other arithmetic operations are common (See Sec. II).

Ciesielski et al. [1] showed that their method could be used for other multiplier-centric arithmetic operations, such as MAC; however, they showed that they only verifed multiplier modules with regular structures. The benchmarks and their verifcation tool are not provided. We do not know of any publicly available tool that can scale and automatically verify designs such as MAC and dot-product. The underlying theory used by the computer-algebra methods may support verifcation of such arithmetic circuits. However, some optimizations that make these tools effcient may or may not be directly applicable to modules beyond isolated multipliers.

Verifying multipliers whose output is truncated or shifted is diffcult for the computer algebra approach. Su et al. [18] discussed why computer algebra techniques are ineffcient when verifying truncated arithmetic circuits. They stated that intermediate expressions, which are manageable in untruncated modules, can grow exponentially in truncated designs. They suggested a method to reconstruct a truncated multiplier into a complete multiplier by adding missing elements before verifcation. They did not discuss the soundness of their approach, their experiments were only on simple multipliers, and the benchmarks and the tool are not provided. Kaufmann et al. [3] suggested using modular arithmetic and defned a specifcation in the ring Z2<sup>n</sup> [X] where n is the multiplier output size. They showed that this approach works on a simple multiplier model, but our experiments with RTL designs resulted in time-out. We are not aware of any computer algebra studies that can verify truncated and/or shifted RTL multipliers.

#### *C. Industrial Methods*

Verifcation efforts of commercial multipliers often involve a great deal of manual work. A common method is to create a simple reference design that is structurally close (isomorphic) to the original and then repeatedly equivalence-check a litany of ever-increasingly complex designs [19]. Some engineers verify reference designs using mechanized proof systems [20]. Another common analysis method is to decompose a design into smaller parts, reason about these parts separately, and then compose these proofs into a top-level theorem [21]–[23]. Finding a workable decomposition and combining individual proofs of multiplier fragments can be a cumbersome task. Such methods help formal verifcation engineers verify various multiplication operations such as multiply-accumulate and dotproduct; however, this usually entails extensive manual effort. Moreover, these proofs are often design-specifc, and even a slight change in the design might cause a previous proof procedure to fail.

#### IV. S-C-REWRITING ALGORITHM

In our previous work [7], we introduced a verifed termrewriting algorithm that can verify a wide range of isolated multiplier designs more quickly than the other state-of-theart tools. In this section, we summarize this term-rewriting algorithm and discuss its recently discovered limitations.

We use the ACL2 theorem prover to verify and run our multiplier verifcation tool. ACL2 is an interactive and automated theorem proving system, and a programming language that is used by both industry and academia [24]. For a target multiplier design, we try to prove conjectures of the form given in Listing 1. defthm is a commonly used utility by ACL2 users, and it asks the ACL2 system to check conjectures. On the left hand side, we specify symbolic simulation of a multiplier design representation. We use the SVL semantics [25] to simulate designs, which are automatically translated from Verilog (our verifcation tool can be used with other simulators as well). The right hand side has the multiplier specifcation; in this example, the target multiplier module returns a 128-bit number equivalent to the multiplication of two 64-bit signed numbers.

Listing 1. A correctness conjecture for a signed 64x64-bit isolated multiplier

```
(defthm multiplier_is_correct
  (implies (and (integerp a)
                (integerp b))
    (equal (simulate :inputs (a b)
                     :design <signed_64x64_mult>)
           (truncate 128
                     (* (signext 64 a)
                         (signext 64 b))))))
```
We prove such conjectures by rewriting both sides of the equality to fxed fnal forms. We defne two functions s (short for *sum*) and c (short for *carry*) as given in Def. 1. The target representations for the frst few output bits of some modules (half, full, vector adders, and multipliers) are given in Table I. Our goal is to rewrite all such modules/operations to this form. We call this s-c representation or s-c form.

Defnition 1. Functions s and c are defned as follows.

$$\begin{aligned} \forall x \in \mathbb{Z} \ s(x) &= mod\_2(x), \\ \forall x \in \mathbb{Z} \ c(x) &= \left\lfloor \frac{x}{2} \right\rfloor \end{aligned}$$

While verifying multiplier designs, we wish not to work with the logical defnition of adder modules but instead work with their s-c representations. The SVL semantics allow hierarchical reasoning such that if we previously prove that symbolic simulation of an adder module can be replaced with this s-c form, then the SVL system can use this form (as

TABLE I TARGETED FINAL FORMS FOR SOME MODULES/FUNCTIONS


opposed to the adder's logical defnition) while expanding the defnition of multiplier designs. Therefore, we frst prove that each distinct adder module can be represented with the s-c form. We use a term-rewriting algorithm to carry out the proofs for adder modules [7]. Since verifying adders is straightforward [3], we omit this rewrite algorithm here for brevity. After the adder proofs, we start verifying the target multiplier design. As we expand the defnition of the multiplier, our program replaces each instance of its adder modules automatically with their s-c representation.

Using the s-c form for adders instead of their logical definitions can bring about simpler expressions representing the output bits of a multiplier. An example of such an expression is given in Example 2 for a Wallace-tree multiplier with simple partial products.

Example 2. The 4th LSB of a Wallace-tree multiplier output when its adders are represented in the s-c form:

$$\begin{array}{c} s(\
s(\
s(a\_3b\_0 + a\_2b\_1 + a\_1b\_2) \\ + a\_0b\_3 \\ + c(a\_2b\_0 + a\_1b\_1 + a\_0b\_2)) \\ + c(s(a\_2b\_0 + a\_1b\_1 + a\_0b\_2) + c(a\_1b\_0 + a\_0b\_1))) \end{array}$$

We rewrite such terms to make them syntactically equivalent to our target fnal form. To do that, we defne a set of lemmas of the form lhs = rhs such that terms that match lhs are replaced with rhs with appropriate term bindings. All lemmas are proved using ACL2 and we omit the proofs here.

We investigated such terms from multiplier designs and realized that we could rewrite and simplify nested calls of s with Lemma 1. Rewriting with this lemma when applicable can simplify the term from Example 2 to the form given in Example 3.

Lemma 1. ∀x, y ∈ Z s(s(x) + y) = s(x + y)

Example 3. Example 2 simplifed with Lemma 1:

$$\begin{array}{l} s(a\_3b\_0 + a\_2b\_1 + a\_1b\_2 + a\_0b\_3 \\ + c(a\_2b\_0 + a\_1b\_1 + a\_0b\_2) \\ + c(s(a\_2b\_0 + a\_1b\_1 + a\_0b\_2) + c(a\_1b\_0 + a\_0b\_1))) \end{array}$$

Now, we observe more than one instance of c on the same summation level. We rewrite and simplify them by a set of lemmas. Lemmas 2-5 are applied to the term as rewrite rules, where the function d is defned as ∀x ∈ Z d(x) = <sup>x</sup> 2 . Then, we get the term in Example 4. This is syntactically equivalent to our target form for the 4th output bit, and we can conclude that the multiplier is correct for this output bit.

\*\*Lemma 2.\*\*  $\forall x, y \in \mathbb{Z}$   $c(x) + c(y) = d(x + y - s(x) - s(y))$ 

\*\*Lemma 3.\*\*  $\forall x, y \in \mathbb{Z}$   $c(x) + d(y) = d(x + y - s(x))$ 

\*\*Lemma 4.\*\*  $\forall x, y \in \mathbb{Z}$   $d(x) + d(y) = d(x + y)$ 

\*\*Lemma 5.\*\*  $\forall x \in \mathbb{Z}$   $d(-s(x) + x) = c(x)$ 

Example 4. Example 3 rewritten with Lemma 2-5:

$$\begin{array}{c} s(a\_3b\_0 + a\_2b\_1 + a\_1b\_2 + a\_0b\_3) \\ + c(a\_2b\_0 + a\_1b\_1 + a\_0b\_2 \\ + c(a\_1b\_0 + a\_0b\_1)) \end{array}$$

As Booth encoding can incorporate multiple input bits when generating partial products, we can see operators for logical gates (e.g., logical OR, XOR) when verifying Booth encoded multipliers. We use a few more simple lemmas to simplify terms from Booth encoding and we derive the same fnal form. These lemmas, along with examples, are provided in our previous work [7], and we omit them here for brevity. These extra lemmas are triggered automatically when Booth encoding is present, and they do not affect other proofs when simple partial products are used.

Once we are done rewriting the left-hand side in Listing 1, we rewrite the right hand side (specifcation) to the same form through proved rewrite rules from our library. When we see that the two sides are syntactically equivalent, we conclude that the multiplier is correct.

Note that our target representation has a separate term for each output bit whereas the computer algebra methods specify all output bits with a single expression (see Example 1). This makes it easier for our method to verify designs whose output may be manipulated on bit level such as by truncating, shifting, and bit-masking.

Example 5. The frst instance of a2b<sup>0</sup> in Example 2 is replaced by a2b<sup>1</sup> to simulate a bug. Then, the rewriting algorithm returned:

$$\begin{array}{l} s(\, \, a\_3 b\_0 + a\_2 b\_1 + a\_1 b\_2 + a\_0 b\_3 \\ \, + d(-s(a\_2 b\_1 + a\_1 b\_1 + a\_0 b\_2) \\ \, -s(a\_2 b\_0 + a\_1 b\_1 + a\_0 b\_2 + c(a\_1 b\_0 + a\_0 b\_1)) \\ \, + s(a\_2 b\_0 + a\_1 b\_1 + a\_0 b\_2) \\ \, + a\_2 b\_1 + a\_1 b\_1 + a\_0 b\_2 \\ \, + c(a\_1 b\_0 + a\_0 b\_1)) \end{array}$$

In our previous work, we did not investigate what happens when the design has a bug and whether or not the algorithm can work beyond isolated multipliers. If our program cannot verify a multiplier for some reason, it returns a term rewritten with our lemmas. For example, when we introduce a simple bug to the term in Example 2, the described rewriting algorithm will return the term given in Example 5. The resulting term is larger than the initial term, and the gap can grow even larger for big designs. When a proof attempt fails, either due to a bug in the design or some problem with our verifcation method, resulting terms are often very large and users do not receive a useful feedback from the program.

A proof attempt might fail even when the target design is correct. We have found such an instance and we could not verify some Booth encoded *merged* multipliers (See Sec. II) larger than 16x16-bit multiplication. Since the resulting terms are so large, we could not understand if there was a missing lemma that could help fnish the proofs. We encountered similar issues with some dot-product and MAC designs, and we were likewise unable to verify them.

#### V. IMPROVEMENTS TO S-C-REWRITING

We have developed and experimented with various alternatives to the existing S-C-Rewriting algorithm. Our goal is to verify designs beyond isolated multipliers and return small terms if a proof attempt fails due to a design bug or a problem in the verifcation system. We have found a rewriting scheme that meets these goals. Instead of rewriting c terms with Lemmas 2-5, we use only the new Lemma 6. Similar to Lemma 1, this lemma extracts the arguments of inner s calls but it also creates a byproduct −c(x).

$$\text{Lemma 6. } \forall x, y \in \mathbb{Z} \ c(s(x) + y) = c(x + y) - c(x)$$

When the given designs are correct, this lemma helps simplify multiplier designs without needing Lemmas 2-5. We have also seen that when this lemma is used, proofs are actually much faster for Booth encoded designs as well as array multipliers by an order of magnitude (see Sec. VII).

For cases where a proof-attempt fails, we apply another lemma (Lemma 7) to cancel out common terms shared between the specifcation and the design. After all our lemmas are applied and the design is simplifed, the rewriter compares if the simplifed design is syntactically equivalent to the specifcation for each output bit. If they are not, then we rewrite the term that represents the equivalence of these two sides with Lemma 7.

Lemma 7. ∀x, y ∈ {0, 1} (x = y) ⇐⇒ (s(x + y) = 0)

Lemma 6 and Lemma 7 help the program return a much smaller term if a proof attempt fails. Assume that we are rewriting a term that checks the equivalence of the term from Example 2 to its specifcation (Example 4). When we introduce the same bug from Example 5 to this term, our new rewrite method will return the term in Example 6.

Example 6. When the same bug from Example 5 is rewritten with the improved rewriting algorithm:

$$\begin{array}{l} s(\
c(a\_0b\_2 + a\_1b\_1 + a\_2b\_0) \\ + c(a\_0b\_2 + a\_1b\_1 + a\_2b\_1)) \\ = 0 \end{array}$$

As seen in this example, the returned term is considerably smaller than what we would get from the older algorithm (Example 5). We have observed the same behavior with larger multipliers so much so that the returned term can sometimes give a hint as to where the bug exists within the design. Moreover, since these terms are often small, we use the FGL [26] or the GL [27], [28] utilities in ACL2 to send such returned terms to an external SAT Solver. We have seen through our experiments (Sec. VII) that SAT Solver can return a counterexample very quickly from simplifed terms.

As noted in Sec. IV, proof attempts may fail even when the design is correct. This was the case with our initial term rewriting strategy for some Booth encoded merged multipliers and some MAC and dot-product modules. Since the returned terms are smaller with the modifed term-rewriting, we could fnd the source of the problem and determine the missing lemmas needed to verify these designs. We found out that we simply need to rewrite some c and s instances in terms of logical operators (see Lemmas 8-11) when certain syntactic conditions on their arguments are met. Those conditions are: the arguments x, y and z (if available) need to be instances of the logical AND (∧) function only, and the operands in y and z (if available) need to be a subset of the operands of x. For example, we can apply Lemmas 8-9 if x = a ∧ b ∧ c ∧ d, y = a ∧ c, and z = b ∧ c but we cannot apply it if z = b ∧ e. The resulting terms from these rewrites are simplifed the same way as Booth encoding logic. We have these strict syntactical conditions so that the rewriting system is more deterministic and there is minimal effect on the verifcation procedures for other designs. We leave these lemmas enabled in our program, and they help automatically verify the previously failed designs, such as merged multipliers.

Lemma 8. ∀x, y, z ∈ {0, 1} c(x+y+z) = x∧y∨x∧z∨y∧z Lemma 9. ∀x, y, z ∈ {0, 1} s(x + y + z) = x ⊕ y ⊕ z Lemma 10. ∀x, y ∈ {0, 1} c(x + y) = x ∧ y

Lemma 11. ∀x, y ∈ {0, 1} s(x + y) = x ⊕ y

Additionally, we tested this method with another simulation tool, SVTV [24], to show that our method does not have to be used with the SVL system. The SVTV system sources designs from Verilog and fattens them before (symbolic) simulation. We found a way to mark the adder modules before fattening to easily rewrite them in the s-c form. We omit the details here for brevity, and the readers may refer to our online tutorials for details (http://mtemel.com/fmcad21).

#### VI. IMPLEMENTATION

All of our rewriting system consists of lemmas of the form lhs = rhs. When patterns found in conjectures match lhs, they should be replaced by rhs. Since conjectures for multiplier designs may yield very large terms, we implement a scalable mechanism to fnd such patterns and apply our lemmas.

We use a verifed rewriter [29] that follows an inside-out rewriting strategy [30], [31]. Example 7 shows how a rewrite rule can modify a term from inside out. We can prove the associativity of summation (see the upper-left corner) using the existing libraries and the built-in axioms in ACL2. The defthm event saves the proved lemma as a rewrite rule. When this rewrite rule is in the system, we can apply it to terms whenever the left hand side pattern fnds a match. Assume that this is the only enabled rule, and we would like to prove another conjecture which contains the term shown on the upper-right corner. Since the rewriter performs inside-out rewriting, it will start with the innermost term to search for matching patterns. The frst match occurs for the following bindings: a to x3, b to x4, and c to x5. With these term bindings, the term is replaced using the right hand side of the rewrite rule, and we obtain the term in the lower-left corner. The rule can fnd another match on this new term. After similarly rewriting this term, we obtain the term in the lower-right corner.

Example 7. A target term is rewritten with a rewrite rule.


Even though the rewriter dives into every subterm, it keeps track of already processed terms and it does not attempt to rewrite them again. For example, assume that x4 in the target term from Example 7 is not a variable but it is a very large term that is already rewritten. After the frst rewrite, x4 will have moved within the term. Since the applied rule has a fxed pattern on the left and right hand sides, the rewriter knows to not process x4 again. On the other hand, if there was an applicable rule, the new subterm (+ x4 x5) could be rewritten.

Our overall rewriting system follows this basic rewriting strategy with many more lemmas that work together harmoniously. Fig. 3 shows a fow diagram when the rewriter processes a conjecture for multiplier designs. Assume that we are using the SVL system for simulation, and the user has already created rewrite rules for adder modules to represent them in the s-c form. When the user states a conjecture for the target multiplier design (see Listing 1) and submits it to ACL2, the rewriter dives into the innermost terms to search for applicable rules. The frst subterm that it rewrites is the symbolic simulation instance for the target multiplier design.

The SVL system simulates designs by executing all the functional blocks (e.g., Verilog assignments and submodules) and one by one calculating the values for all internal wires and registers. As the rewriter is symbolically simulating an SVL design, derived expressions for internal wires and registers are tested against rewrite rules. If the rewriter encounters an

Fig. 3. Steps taken by the rewriter when rewriting a conjecture for a multiplier design

instantiation of an adder module, then it is replaced by the s and c functions using the rules created by the user. If the rewriter encounters some other module or an assignment, then regular ACL2 expressions representing their functionality are created from their logical defnitions.

When new instances of the s and c are created after the adder modules are rewritten, our lemmas for these functions are triggered and our simplifcation algorithm is applied. For example, when the new term is an instance of c and one of its arguments is an instance of s, then Lemma 6 will be applied. If the arguments of the new s and c instances contain some Boolean expressions, then our lemmas for Booth encoding [7] are applied.

As the symbolic simulation of the circuit fnishes, we get a term that is completely rewritten with our algorithm. After that, the system rewrites the right hand side (specifcation) to the s-c form with other rewrite rules in our library, compares the two sides syntactically, and exits. If the fnal term is t, then we can conclude that the multiplier is correct. Otherwise, we can investigate this term and/or send it to a SAT solver so as to generate counterexamples or attempt to fnish the proofs.

Note that our lemmas described in Sec. IV, Sec. V, and our previous work [7] do not trigger an expensive rewriting chain upon application. They each have an almost constant time complexity. The slowest component of the rewriting algorithm is lexicographical sorting of the terms in column summations, which are expected to be very small sets as compared to the overall size of the given design. Since our lemmas are applied as the circuit's defnition is expanded and we never perform a global search, we observe an almost linear time complexity with respect to the design size as shown in the next section.

#### VII. EXPERIMENTS

We verifed various multiplier designs using our tool and applicable tools from related work. We ran our experiments on an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz computer with 32GB system memory. We used three RTL multiplier

TABLE II PROOF-TIME RESULTS IN SECONDS (ROUNDED) FOR VARIOUS UNTRUNCATED, SIGNED ISOLATED MULTIPLIER DESIGNS


MO: Out of memory (32GB) TO: Time-out (5400 secs./90 mins. for 64x64 and 128x128 multipliers, 16200 secs./270 mins. for the rest)

generators [32]–[34] to generate isolated multipliers, MAC, and dot-product designs. The benchmarks and our tool are available online (http://mtemel.com/fmcad21).

We verifed various architectures with different confgurations. For partial product generation algorithms, the designs use either simple partial products (*sp*), Booth encoding radix-4 (*b4*) or radix-2 (*b2*). Summation tree reduction algorithms include counter-based Wallace (*cwt*), array (*ar*), Dadda (*dt*), traditional Wallace (*wt*), overturned-stairs (*os*), balanced delay (*bdt*), redundant binary addition (*rbat*), 4-to-2 compressor (*4:2*), 7-to-3 compressor (*7:3*) trees, and merged multipliers with Dadda tree (*mdt*). For fnal stage addition, these multipliers implement Kogge-Stone (*ks*), ripple-carry (*rc*), Brent-Kung (*bk*), Han-Carlson (*hc*), Ladner-Fischer (*lf*), carry-select (*csel*), conditional sum (*csu*), variable-length carry-skip (*vcska*), block carry-lookahead (*bcla*) and regular carry-lookahead (*cla*) adders.

As far as we are aware, there are only two other publicly available tools from two different research groups that can verify these complex architectures for isolated multipliers. These are computer-algebra-based tools RevSCA2 [4] (shortened as RS) and AMulet 2.0 [3], [35] (shortened as AMu). The tools from other studies are not publicly available and/or they do

TABLE III PROOF-TIME RESULTS IN SECONDS FOR SOME MULTIPLIER DESIGNS IN VARIOUS CONFIGURATIONS


TO: Time-out (5400 secs) NS: Confguration is not supported by the tool. F: Failed proof-attempt. The tool returns a large rewritten term.

not provide competitive results for the designs in question. RevSCA2 does not produce certifcates and it is not verifed. AMulet provides certifcates to check the validity proofs by external tools; we include the certifcation time in our results (they can be around 3 times faster without certifcation). The verifcation tools from our previous and current work are verifed using ACL2; thus, no additional check is required.

Table II delivers the proof-time results in seconds for signed and untruncated isolated multipliers. Our previous work scales substantially better than (RS [4]) and (AMu [3]) but the performance is not as strong for Booth encoded designs. Our improved rewriting algorithm is much faster than our previous work and others, and it can verify even very large Booth encoded multipliers in at most 5 minutes.

Table III delivers proof-time results for various architectures and confgurations. This includes truncated or right shifted outputs, merged multipliers, multipliers with different operand sizes, two-point dot-product designs with accumulate, and truncated or untruncated MAC modules. The designs in this table are produced with two different generators [32], [33]. AMulet has a hard-coded specifcation and does not support many of these confgurations. Users can determine the design specifcations for our previous work, but our older tool cannot prove some merged multipliers, dot-product, and MAC designs. On the other hand, our new method could verify all of them very quickly.

Table IV shows how the proof-time performance of our tool

TABLE IV OUR TOOL'S PROOF-TIME RESULTS IN SECONDS FOR SIGNED MAC AND DOT-PRODUCT DESIGNS


All designs use Booth radix-4 encoding, Dadda tree and Ladner-Fischer adder.

TABLE V OUR TOOL'S PROOF-TIME RESULTS IN SECONDS FOR OUR EXAMPLE MODULE, INTEGRATED MULTIPLIERS, DESCRIBED IN SEC. II-C


scales on dot-product designs with different sizes. Even though it is not shown here, allocated system memory scales similarly. Finally, Table V shows the proof-time results for our example module integrated multipliers (see Sec. II-C) for both the SVL and SVTV simulation systems.

In addition to the designs reported here, we have also verifed some private industrial designs at Centaur Technology with a similar performance. These designs include multiply-accumulate, dot-product, multiplication of signed and unsigned numbers, truncation, right-shifting, rounding, and saturation. Our program is not designed to handle branches implemented for saturation. Therefore, after our program simplifed the saturated designs, we sent the resulting terms to a SAT Solver (*glucose* [36]) with the FGL utility [26], [37], and we have seen that proofs fnished successfully in a few seconds.

We have also tried our tool on buggy designs and used a SAT solver (*glucose* [36]) to create counterexamples from simplifed terms. We randomly inserted (one or more) bugs into various 64x64-bit, 128x128-bit, and 256x256-bit designs and experimented with 20 different scenarios. Our tool rewrote each multiplier design and returned simplifed terms within the same amount of time as given in Table II. It took the SAT solver between 0.1 to 10 seconds to return a counterexample from rewritten terms. Our previous tool could not be used in this workfow because it returns massive terms when proofattempts fail (see Sec. IV). Using the SAT solver with the original conjecture (in other words, without rewriting with our tool) could give a counterexample in some cases after a few minutes, but it timed out (60 minutes) in the majority of cases. Additionally, our tool can tell exactly which output bits are mismatching the specifcation. With our new method, we see that our term-rewriting strategy can be very practical and effcient for debugging fawed designs.

#### VIII. CONCLUSION

We have presented a term-rewriting method that can be used to verify digital circuit designs with embedded integer multipliers. Our tool is effcient, automated, and provably correct. We have shown that we can verify isolated multipliers as large as 1024x1024-bit in less than 5 minutes. Our system allows the user to modify the specifcation per the target design. Therefore, we can verify multipliers with unusual operand sizes, whose output may be truncated, right-shifted, rounded or saturated. In addition, we can verify other multipliercentric arithmetic operations such as dot-product and multiplyaccumulate. Our library and tutorials are distributed with the ACL2 system, and this content is available online for public use (http://mtemel.com/fmcad21).

This work has been a continuation of our earlier study [7]. With the improvements detailed in this paper, we can verify Booth encoded designs with a much better proof-time effciency, along with MAC, dot-product, and merged multiplier designs. In addition, we can now generate counterexamples for buggy designs. Moreover, we provide a more comprehensive summary of various multiplier design techniques and discuss why they might be challenging for verifcation tools.

We use the ACL2 programming language and interactive theorem prover to run and verify our multiplier verifcation tool, and we use the SVL semantics as our preferred method to simulate Verilog designs. However, our term rewriting algorithm does not require any specifc feature from a particular a theorem prover or anything unique to the SVL system. Using a term rewriter and a simulator with hierarchical reasoning can be enough to implement our algorithm on any platform.

We have exploited design hierarchy when implementing our algorithm, whereas the other state-of-the-art tools [3], [4] work on fattened designs. We should note that these tools more or less depend on the original design having clear boundaries for adder modules for their good proof-time performance in the majority of cases. Our choice to use a symbolic simulation system that allows hierarchical reasoning reduces engineering costs and simplifes our program. This way, we do not need to implement any detection algorithm for adder logic. If necessary, using our term-rewriting algorithm for fattened designs might be possible by implementing some preprocessing techniques to reconstruct the design hierarchy. On the other hand, incorporating hierarchical reasoning into computer algebra methods may help improve their performance.

We continue to exercise and improve our method with ever more complex designs such as foating-point multiplication. We have laid a groundwork to permit verifcation procedures with improved automation and effciency. The convenience that comes with our fast and automatic verifcation process can contribute to building reliable hardware systems that include embedded integer multipliers of varying sizes, including but not limited to general-purpose processing units, image processors, digital signal processors, and secure cryptoprocessors.

#### REFERENCES


*Models for Codesign (MEMOCODE)*. Cambridge, UK: IEEE/ACM, July 2011, pp. 89–97.


Formal Methods in Computer-Aided Design 2021

IC3 with Internal Signals

Rohit Dureja IBM

Arie Gurfinkel University of Waterloo Alexander Ivrii IBM

Yakir Vizel The Technion

*Abstract*—IC3 is a highly-effective algorithm for formal hardware verification. It cleverly uses a SAT solver to compute an inductive invariant, an over-approximation of reachable states, of a hardware design. The invariant is computed in CNF as a conjunction of lemmas. This CNF representation over state variables, although efficient, leads to an obvious deficiency: IC3 is not effective for designs that do not have a concise CNF invariant over state variables. We show how to remedy this deficiency by extending traditional IC3 to learn invariants not only in terms of state variables, but also in terms of internal signals of the design. Our proposed method can learn significantly more compact invariants than IC3, while maintaining a highly-efficient CNF representation. We evaluate our technique on several industrial sequential equivalence checking (SEC) problems from IBM, SEC problems derived from designs in the Hardware Model Checking Competition (HWMCC) and SEC problems from academia. In addition, we evaluate it on HWMCC benchmarks. IC3 with internal signals is efficient for SEC and outperforms traditional IC3 on an important class of benchmarks.

#### I. INTRODUCTION

IC3 [1], [2] is a powerful algorithm for formal hardware verification, and is the primary model-checking engine in various state-of-the-art formal verification tools. IC3, and its several variants [3], is especially useful for establishing system safety (i.e., discovering an inductive invariant). Whenever IC3 succeeds in proving safety, it finds an inductive invariant justifying the property. Traditionally, such an invariant is a conjunction of lemmas represented in CNF, each lemma is a disjunction of literals, and each literal is either a state variable or its negation. Conversely, IC3 does not succeed in proving a property when it is unable to find such an inductive invariant within the specified verification-resource limits. This can happen for one of two reasons: (i) a small inductive invariant exists but IC3 is unable to find it, or (ii) a small inductive invariant does not exist. It is difficult to determine which of these two cases is responsible for IC3 failing to prove a property. Most research on improving IC3 (e.g., [4]–[6]) focuses on quickly finding the inductive invariant. However, finding the inductive invariant quickly can only help if a (reasonably) small invariant exists in the first place.

A known Achilles heel of IC3 are model-checking problems for which any inductive invariant (over state variables) is necessarily exponential in size. For example, let x1, . . . , x<sup>n</sup> be state variables, and suppose that the set of reachable states is characterized by {x1, . . . , x<sup>n</sup> | x1⊕· · ·⊕x<sup>n</sup> = 1}, while the set of bad states is characterized by {x1, . . . , x<sup>n</sup> | x1⊕· · ·⊕x<sup>n</sup> = 0}. In this case the (only) inductive invariant is exponential in size and contains 2 n−1 clauses that correspond to representing x<sup>1</sup> ⊕ · · · ⊕ x<sup>n</sup> = 1 in CNF. With n = 3, the inductive invariant contains four clauses: (¬x<sup>1</sup> ∨ ¬x<sup>2</sup> ∨ x3) ∧ (¬x<sup>1</sup> ∨ x<sup>2</sup> ∨ ¬x3) ∧ (x<sup>1</sup> ∨ ¬x<sup>2</sup> ∨ ¬x3) ∧ (x<sup>1</sup> ∨ x<sup>2</sup> ∨ x3). A possible work-around is to extend the design with additional signals that are necessary to concisely represent an invariant. In this example, IC3 extended with a lemma over z = x<sup>1</sup> ⊕ · · · ⊕ xn, can find a tiny inductive invariant consisting of only a single unit-clause lemma: (z = 1).

This leads to the question of which additional signals to consider. A possible solution is to consider variables that represent logic gates in the transition relation of the system model. We refer to these as *internal nets* or *innards*. Prior work [7] uses innards to extend ternary valued simulation of counterexamples to induction in IC3, which enables a succinct description of the set of states that IC3 must eventually block. In this paper, we propose an approach based on learning lemmas directly over innards that improves the performance of IC3 in establishing safety by finding more concise inductive invariants. Our method of learning lemmas over internal nets can be viewed as a form of inductive generalization. A lemma is first generalized as usual, and then literals corresponding to latches are replaced by internal nets. Specifically, whenever IC3 learns a lemma C over state variables, it also tries to learn an additional lemma C<sup>2</sup> over state variables and internal signals. To this end, we first extend C to a lemma C<sup>1</sup> that is logically equivalent to C but contains the literals of C and (certain) internal nets. We obtain C<sup>2</sup> by inductively generalizing C1, while guiding the inductive generalization to remove state variables. It is guaranteed that C<sup>2</sup> is stronger than C. Therefore, C<sup>2</sup> blocks the same states (and maybe more) as C. We then add lemma C<sup>2</sup> to IC3's inductive trace, so that it can be used for predecessor queries and convergence checks. A major advantage of our approach is that it can be easily integrated with any existing mature IC3 implementation.

Our work is motivated by a challenging set of microprocessor verification problems that arise from the Aspect-Oriented Design (AOD) methodology used at IBM. The verification problem checks sequential equivalence of an original design against a new version of the design with added aspects (e.g., clock-gating, logging, or debug interfaces). The complex verification challenge is broken into many sub-tasks using a combination of the usual sequential equivalence checking (SEC) approaches, including k-induction, speculative reduction, and localization [8]–[11]. Verification sub-tasks that are not solved by these techniques are then checked using Interpolation-based Model Checking (IMC) or IC3. Traditional IC3 scales very poorly for these verification problems. On the other hand, IMC works rather well but is not stable – small changes in the

design negatively impact verification times. The proposed IC3 algorithm with internal signals significantly outperforms both IMC and traditional IC3.

The proprietary nature of IBM AOD verification problems prohibits detailed public disclosure. Nevertheless, we apply the IBM AOD sequential equivalence checking flow on two selected benchmarks from the Hardware Model Checking Competition (HWMCC) to validate equivalence between the original design and its retimed [12] versions. Each such equivalence-check generates hundreds of verification problems of which some are solved by k-induction, but a significant number remain unsolved. We note that IC3 with internal signals is more effective than traditional IC3 in solving the remaining equivalences for both SEC problems. We also apply our algorithm on a small set of publicly available SEC benchmarks [13] from academia, and note that our proposed algorithm is able to solve a higher number of equivalences compared to traditional IC3. This suggests that using internal nets in IC3 is especially effective for difficult sequential equivalence checking problems.

To further validate the efficacy of IC3 with internal signals, we apply the proposed algorithm to a variety of single-property benchmarks from HWMCC. However, the technique does not show a significant improvement unlike our experience with IBM AOD and other benchmarks. There are a few HWMCC benchmarks that are solved significantly faster and some that are uniquely solved by our algorithm, but overall, traditional IC3 is superior. Interestingly, the number of designs where the new technique succeeds increases in the latest competition editions that are based on word-level designs. This points to a deficiency of any benchmark set – the distribution of problems in the set does not necessarily correspond to their distribution in practice. Techniques that perform well on only a few benchmarks in the set, might actually be very effective in some practical application!

The rest of the paper is organized as follows. Section II provides the necessary background. Section III describes motivating examples to highlight the core deficiency of IC3 addressed by our approach. Section IV describes the IC3 algorithm with internal signals, while Section V reports on our experimental evaluation. Section VI discusses related and future work, and Section VII concludes.

#### II. BACKGROUND

#### *A. Safety Verification Problem*

We represent a finite state transition system S as a tuple hi, x,Init(x), Tr (i, x, x<sup>0</sup> )i, which consists of primary inputs i, state variables x, predicate Init(x) defining the initial states, and predicate Tr (i, x, x<sup>0</sup> ) defining the transition relation. Nextstate variables are denoted as x 0 . We assume that Tr is represented as a *netlist*, that is, a directed acyclic graph with nodes corresponding to logic gates. Given the values of x and i, the values of x <sup>0</sup> may thus be uniquely computed by (constant) propagation – i.e., using Boolean or three-valued simulation. We say that a *net* is either an input, a state variable or a logic gate. We refer to state variables and their negations as *latches*, and to internal logic gates and their negations as *innards*. We say that an innard is *input-free* if it does not have any inputs in its combinational cone-of-influence.

A *clause* is a disjunction of literals, where each literal is either a net or its negation. We say that a clause is *over latches* to emphasize all the literals in the clause are latches. A Boolean formula in *Conjunctive Normal Form (CNF)* is a conjunction of clauses. A *cube* is a conjunction of literals. A Boolean formula in *Disjunctive Normal Form (DNF)* is a disjunction of cubes. It is often convenient to treat a clause or a cube as a set of literals, a CNF as a set of clauses, and DNF as a set of cubes. For example, given a CNF formula F, a clause c and a literal `, we write ` ∈ c to mean that ` occurs in c, and c ∈ F to mean that c occurs in F.

A *trace* is a sequence of Boolean valuations to the nets, starting with an initial state satisfying Init and with successive time-step valuations consistent with Tr . *Reachable states*, denoted by Reach, are states that can be reached on a trace. Let Bad(x) be a predicate defining *bad* (or *unsafe*) states. The *safety verification problem* consists of checking whether Reach ⇒ ¬Bad, that is either finding a trace that leads to a state in Bad or showing that such a trace does not exist.

### *B. Traditional IC3*

We give a very brief and high-level description of IC3, concentrating on the components that are relevant for this work. This description includes the classical IC3 algorithm [1], [2], and some of its variants such as [6]. In what follows, we refer to all these algorithms simply as IC3.

IC3 proves safety by finding a formula Inv(x), called *a safe inductive invariant*, that satisfies the following conditions:

$$Init(x) \Rightarrow Inv(x) \tag{1}$$

$$(Inv(x) \land \exists i \cdot Tr(i, x, x')) \Rightarrow Inv(x') \tag{2}$$

$$Inv(x) \Rightarrow \neg Bad(x) \tag{3}$$

The computed formula Inv(x) is in CNF over latches. Internally, IC3 maintains sets of clauses F0, F1, . . . called an *inductive trace*. Each F<sup>k</sup> in a trace is called a *frame*, each clause c ∈ F<sup>k</sup> is called a *lemma*, and the index of a frame is called a *level*. We assume that F<sup>0</sup> is initialized to Init and that Init ⇒ ¬Bad. IC3 maintains the following invariant:

$$F\_0 = Int \qquad F\_{k+1} \subseteq F\_k \qquad F\_k \land Tr \Rightarrow F'\_{k+1}$$

Note that the inductive trace maintained by IC3 is syntactically monotone, and each Fk+1 is inductive relative to Fk. Let Reach<sup>≤</sup><sup>k</sup> denote the set of states reachable from Init in k steps or less. It holds that Reach<sup>≤</sup><sup>k</sup> ⇒ Fk, i.e., F<sup>k</sup> is an over-approximation of states reachable in k steps or less.

Additionally, IC3 maintains a queue of *proof obligations* (or *CTI's*) of the form hm, ki where m is a cube over latches and k > 0 is a *level*. At each point of the execution, it considers a proof obligation hm, ki, and makes an *initial* query SAT?(Init∧¬m) that checks whether a state in m is an initial state, and a *predecessor* query SAT?(¬m ∧ Fk−<sup>1</sup> ∧ Tr ∧ m<sup>0</sup> ) that checks whether a state in m can be reached from a state in Fk−1. If both results are unsatisfiable, IC3 can add the lemma ¬m to all F<sup>j</sup> , for j ≤ k, refining the inductive trace. However, for performance it is crucial to *inductively generalize* ¬m first, finding a lemma ϕ ⊆ ¬m, that also satisfies Init ⇒ ϕ and ϕ ∧ Fk−<sup>1</sup> ∧ Tr ⇒ ϕ 0 (some IC3 variants such as Quip also keep an under-approximation of Reach and modify Init to include this under-approximation). The inductive generalization is typically done by removing literals from ¬m while the two conditions remain satisfied. We refer the reader to [3] for more details.

IC3 periodically *pushes* all lemmas, by checking if a lemma ϕ ∈ F<sup>k</sup> \ Fk+1 can be added to Fk+1 as well. If at any point, F<sup>k</sup> = Fk+1 and F<sup>k</sup> ⇒ ¬Bad, then we can take Inv = F<sup>k</sup> as the safe inductive invariant.

#### III. MOTIVATING EXAMPLES

In this section, we motivate our work with several examples. Each is a series of problems such that inductive invariants in CNF over latches grow exponentially, while the corresponding inductive invariants over latches and innards grow linearly. The examples are sketched briefly here, we provide full details with AIGER and source files in the companion repository.<sup>1</sup> Note that the examples are distilled to their essence. For some, the property itself is inductive. Thus, traditional IC3 that learns invariants over latches *and* the property is able to solve them. However, the illustrated problems remain when the examples are parts of a larger design, and the property is more complex and is no longer inductive on its own.

Example 1 (Parity) Let x1, . . . , x<sup>n</sup> be the latches. The set of reachable states is characterized by {x1, . . . , x<sup>n</sup> | x<sup>1</sup> ⊕ · · · ⊕ x<sup>n</sup> = 1}. The set of bad states is characterized by {x1, . . . , x<sup>n</sup> | x<sup>1</sup> ⊕ · · · ⊕ x<sup>n</sup> = 0}. Note that the only safe inductive invariant over latches has 2 n−1 clauses representing x<sup>1</sup> ⊕ · · · ⊕ x<sup>n</sup> = 1 in CNF. Yet, there is a safe inductive invariant consisting of a single lemma, (z = 1), for the innard z = x<sup>1</sup> ⊕ · · · ⊕ xn. ✷

Example 2 (from [14]) Consider two counters that count modulo-2 <sup>n</sup>, whose state bits are s = (s0, . . . , sn−1) and t = (t0, . . . , tn−1), respectively. Let i be an input. When i = 0 both counters keep their values; when i = 1 both counters increment their values by one modulo 2 <sup>n</sup>. Suppose that the initial state is {s 6= t}, and the bad state is {s = t}. The work [14] argues that any safe inductive invariant for the usual IC3 must contain at least 2 <sup>n</sup> lemmas. Furthermore, there is a much smaller safe inductive invariant for the *Reverse IC3* that consists of 2n lemmas required to represent s = t in CNF. With innards, there is an inductive invariant consisting of a single lemma, (z = 1), for the innard z = (s 6= t). ✷

Example 3 (SEC) This example illustrates a sequential equivalence checking problem between an original and a retimed [12] design. Let the "original part" of the design consist of latches x1, . . . , x<sup>n</sup> and inputs i1, . . . , in, such that init(xk) = 0 and next(xk) = i<sup>k</sup> for k = 1, . . . , n, and a net z = x<sup>1</sup> ⊕· · ·⊕xn. Let the "retimed part" of the design consist of a net u = i<sup>1</sup> ⊕ · · · ⊕ i<sup>n</sup> and a latch v with init(v) = 0 and next(v) = u. Let the the bad state be {z 6= v}. The only safe inductive invariant is v ↔ (x<sup>1</sup> ⊕ · · · ⊕ xn), that consists of 2 <sup>n</sup> lemmas in CNF. With innards, an alternative invariant requires only two lemmas: v → z and z → v. ✷

Example 4 This example is motivated by the benchmark rast-p16 from HWMCC'20. The design contains latches x1, . . . , x<sup>n</sup> and y1, . . . , yn, and innards z<sup>1</sup> = x<sup>1</sup> ∧ y1, . . . , z<sup>n</sup> = x<sup>n</sup> ∧ yn. Assume that the lemma C = (z<sup>1</sup> ∨ · · · ∨ zn) over innards is inductive. Representing C in CNF over latches requires 2 <sup>n</sup> lemmas. For example, for n = 3, the lemma (z<sup>1</sup> ∨ z<sup>2</sup> ∨ z3) is equivalent to 8 lemmas (x<sup>1</sup> ∨ x<sup>2</sup> ∨ x3), (x<sup>1</sup> ∨ x<sup>2</sup> ∨ y3), (x<sup>1</sup> ∨ y<sup>2</sup> ∨ x3), (x<sup>1</sup> ∨ y<sup>2</sup> ∨ y3), (y<sup>1</sup> ∨ x<sup>2</sup> ∨ x3), (y<sup>1</sup> ∨ x<sup>2</sup> ∨ y3), (y<sup>1</sup> ∨ y<sup>2</sup> ∨ x3), (y<sup>1</sup> ∨ y<sup>2</sup> ∨ y3). ✷

#### IV. FINDING LEMMAS OVER INNARDS

In this section, we provide an overview of our approach (Sec. IV-A), followed by an algorithm for extending IC3 lemmas with innards (Sec. IV-B), and finally an algorithm for inductive generalization in the presence of innards (Sec. IV-C).

#### *A. The overall approach*

Traditional IC3 learns lemmas by inductively generalizing negations of blocked proof obligations. Both proof obligations and lemmas are over latches. These lemmas are then added to IC3's inductive trace and used in future predecessor and convergence checks. In our approach, proof obligations are also over latches (exactly the same as in traditional IC3), however, we extend learning lemmas over both latches and innards. Our results apply to arbitrary innards, but for simplicity of presentation in the rest of the paper, we restrict to input-free innards, calling them simply innards. Note that unlike [7], our restriction is for presentation only. Throughout the section, we use the following running example.

Example 5 Let w, x, y, z be latches and i be an input. Let

$$\begin{aligned} Int &\triangleq w \land x \land y \land z \\ Tr &\triangleq (w' = \neg w) \land (x' = w) \land (y' = w) \land \\ &(g = x \land y) \land (h = g \land i) \land (z' = h) \end{aligned}$$

This design has two gates: g = x ∧ y and h = g ∧ i, where g is input-free and h depends on the input i. Hence, the set of (input-free) innards is {g}. ✷

We extend IC3 to reason about innards in the initial state and the next state. To this end, let Tr inn be the part of the transition relation that defines innards, and let Init <sup>d</sup> , Init <sup>∧</sup> Tr inn and Tr<sup>c</sup> , Tr <sup>∧</sup> Tr inn 0 . In Example 5,

$$\begin{aligned} Tr\_{inn} &= (g = x \land y) \quad Int \dot{t} = Int \land (g = x \land y) \\ \widehat{Tr} &= Tr \land (g' = x' \land y') \end{aligned}$$

where g 0 is a copy of g in "the next state". The following definition extends relative induction [1] to lemmas over latches and innards.

<sup>1</sup>https://github.com/agurfinkel/innard-benchmarks.

Input: Frame k, Lemma C over latches, s.t. C is inductive relative to F<sup>k</sup> Output: Lemma C<sup>2</sup> over latches and innards, s.t. C<sup>2</sup> is inductive relative to F<sup>k</sup> <sup>1</sup> C<sup>1</sup> ← ExtendLemma(C) <sup>2</sup> C<sup>2</sup> ← InductivelyGeneralize(k, C1) <sup>3</sup> return C<sup>2</sup>

Fig. 1. Procedure LearnAdditionalLemma.

Definition 1 A lemma C over latches and innards is inductive relative to a set of lemmas <sup>G</sup> *iff* (i) Init <sup>d</sup> <sup>⇒</sup> <sup>C</sup>, and (ii) <sup>G</sup> <sup>∧</sup> Tr<sup>c</sup> <sup>∧</sup> <sup>C</sup> <sup>⇒</sup> <sup>C</sup> 0 .

Def. 1 generalizes the original definition: if a lemma C over latches is relatively inductive in the original sense of [1], then C is also relatively inductive by Def. 1. In what follows, by *relatively inductive*, we always mean Def. 1. Continuing our running example, let C = (w ∨ x) (note that C is over latches), and C<sup>1</sup> = (w ∨ x ∨ g) (note that C<sup>1</sup> is over latches and innards). Then, both C and C<sup>1</sup> are inductive relative to <sup>G</sup> <sup>=</sup> <sup>&</sup>gt;. Note that Init <sup>d</sup> <sup>⇒</sup> <sup>C</sup>, > ∧ Tr<sup>c</sup> <sup>∧</sup> <sup>C</sup> <sup>⇒</sup> <sup>C</sup> 0 , Init <sup>d</sup> <sup>⇒</sup> <sup>C</sup>1, > ∧ Tr<sup>c</sup> <sup>∧</sup> <sup>C</sup><sup>1</sup> <sup>⇒</sup> <sup>C</sup> 0 <sup>1</sup> hold.

The following lemma shows that using relatively inductive (in the sense of Def. 1) lemmas in IC3 is sound.

Lemma 1 (Soundness) *For any lemma* C *over latches and innards, if* Init <sup>d</sup> <sup>⇒</sup> <sup>C</sup> *and* <sup>F</sup><sup>k</sup> <sup>∧</sup> Tr<sup>c</sup> <sup>∧</sup> <sup>C</sup> <sup>⇒</sup> <sup>C</sup> <sup>0</sup> *hold, then* C *includes* R<sup>≤</sup>k+1 *(all the states reachable in up to* k + 1 *steps from* Init*). In particular,* C *can be added to IC3's inductive trace up to the frame* k + 1*.*

Our approach of learning lemmas over innards is a form of inductive generalization. Each time that IC3 blocks a proof obligation and learns a (relatively inductive) lemma over latches, we generalize it into an (additional) lemma over latches and innards. The overall algorithm LearnAdditionalLemma is shown in Fig. 1. We give a high-level overview of LearnAdditionalLemma, while the details of key functions are described in later sections. The approach consists of two steps:

*Step 1:* The procedure ExtendLemma extends lemma C (over latches) to a lemma C<sup>1</sup> = C ∨ C<sup>0</sup> (over latches and innards) such that Tr inn ⇒ (C ⇔ C1), i.e. C and C<sup>1</sup> are equivalent modulo Tr inn. The details are in section IV-B. For instance, in our example lemmas C = (w ∨ x) and C<sup>1</sup> = (w ∨ x ∨ g) are equivalent, given that g = x ∧ y. Indeed, modulo Tr inn: (w ∨ x ∨ g) ≡ (w ∨ x ∨ (x ∧ y)) ≡ (w ∨ x). It also follows (see Lemma 1) that C<sup>1</sup> remains relatively inductive.

*Step 2:* The procedure InductivelyGeneralize inductively generalizes C<sup>1</sup> by removing literals, while prioritizing removal of latches (the original literals of C), and more generally trying to leave only the "intereresting" innards. The details are in section IV-C. In our example, lemma C<sup>1</sup> = (w ∨x∨g) can be generalized to C<sup>2</sup> = (w ∨ g).

By construction, it follows that C<sup>2</sup> remains inductive relative to Fk. Moreover, as Tr inn ⇒ (C ⇔ C1), and C<sup>2</sup> ⇒ C1, then C<sup>2</sup> is potentially stronger than the original lemma C (but the converse might not hold). In our example, C<sup>2</sup> = (w∨g) is equivalent to (w ∨(x∧y)) = (w ∨x)∧(w ∨y), i.e. the lemma C<sup>2</sup> over latches and innards represents two different lemmas over latches only. It is also interesting to note that while the original lemma C was over latches {w, x}, the "additional" lemma (w ∨ y) is over a different set of latches {w, y}.

Whenever ExtendLemma does not add any innards to C, the procedure LearnAdditionalLemma stops immediately, without calling InductivelyGeneralize. However, note that even when ExtendLemma adds new literals, it is possible that InductivelyGeneralize removes them, resulting in the original lemma C! When LearnAdditionalLemma returns a lemma C<sup>2</sup> that is different from C, C<sup>2</sup> is also added to IC3's inductive trace (up to frame Fk+1), and hence is also used in future predecessor and pushing queries.

#### *B. Extending lemmas with innards*

The procedure ExtendLemma receives a lemma C over latches as input and returns a lemma C<sup>1</sup> over latches *and* innards as output. It iteratively finds innards z such that Tr inn ⇒ (z ⇒ C) and replaces C with C ∨ z. It works as follows: instead of searching for an innard z that implies C, it searches for all innards ¬z that are implied by ¬C and take their negations. Specifically, given a lemma C = (c1∨· · ·∨cm), we set each c<sup>i</sup> ∈ C to 0 and find which innards are implied by constant propagation in the Tr inn part of the netlist. The algorithm for constant propagation in a netlist is standard and is not presented here.

Going back to our running example, given a lemma C = (w ∨ x), we are looking for innards implied by the partial assignment (w = 0)∧(x = 0). Since g = x∧y, by propagation we obtain that g = 0. Thus, modulo Tr inn, g ⇒ C, and hence C is equivalent to (C ∨ g) = (w ∨ x ∨ g). Note that by not considering input-free innards only (recall, we consider only input-free innards for simplicity of presentation), then, by propagation, we would also obtain that h = (g ∧ i) = 0. This would allow us to extend C to (C ∨g ∨h) = (w ∨x∨g ∨h). The following lemma follows by construction.

Lemma 2 *Given lemma* C *over latches, the procedure* ExtendLemma *returns a lemma* C<sup>1</sup> *over latches and innards such that* Tr inn ⇒ (C<sup>1</sup> ⇔ C)*.*

Corollary 1 *Let* C *and* C<sup>1</sup> *be lemmas over latches and innards respectively, such that (i)* C *is inductive relative to some* G*, and (ii)* Tr inn ⇒ (C<sup>1</sup> ⇔ C)*. Then,* C<sup>1</sup> *is also inductive relative to* G*.*

We remark that extending lemmas with literals that imply it is closely related to *asymmetric literal addition* [15] in SAT. We also remark that the condition that the original lemma C is over latches is not essential, and ExtendLemma can be used to extend lemmas that already have innards in them. This may be potentially useful for additional IC3 extensions.

Input: Frame k, lemma C over latches and innards, s.t. C is inductive relative to F<sup>k</sup> Output: (Inductively generalized) lemma C<sup>2</sup> ⊆ C over latches and innards, s.t. C<sup>2</sup> is inductive relative to F<sup>k</sup> <sup>1</sup> C ← SortLemma(C) // C = {c1, . . . , cn} <sup>2</sup> for i = 1, . . . , n do <sup>3</sup> if c<sup>i</sup> *has already been removed from* C then // do nothing <sup>4</sup> else if Tr inn ⇒ ((C \ ci) ⇔ C) then <sup>5</sup> C ← C \ c<sup>i</sup> <sup>6</sup> else if Init <sup>d</sup> <sup>⇒</sup> <sup>C</sup> \ <sup>c</sup><sup>i</sup> *and* <sup>F</sup><sup>k</sup> <sup>∧</sup> Tr<sup>c</sup> <sup>∧</sup> (<sup>C</sup> \ <sup>c</sup>i) <sup>⇒</sup> (<sup>C</sup> \ <sup>c</sup>i) 0 then <sup>7</sup> C ← C \ c<sup>i</sup> <sup>8</sup> for j = i + 1, . . . , n do <sup>9</sup> if c<sup>j</sup> *not used in the above proofs* then <sup>10</sup> C ← C \ c<sup>j</sup> <sup>11</sup> else <sup>12</sup> break <sup>13</sup> return C

Fig. 2. Procedure InductivelyGeneralize: inductively generalizes lemmas over latches and innards.

#### *C. Inductively generalizing lemmas with innards*

Inductive generalization in traditional IC3 starts with a relatively inductive lemma C over latches (satisfying the conditions Init ⇒ C and F<sup>k</sup> ∧ Tr ∧ C ⇒ C <sup>0</sup> with respect to a given frame Fk), and attempts to remove literals from C as long as C remains relatively inductive. The same procedure can be immediately applied to a lemma over latches and innards, once Init <sup>d</sup> and Tr<sup>c</sup> are used instead of Init and Tr , respectively. However, we found that a naive application of inductive generalization gives poor results. In most cases, it simply removes the innards that were previously added by ExtendLemma, and therefore, ends up with the original lemma over latches. Moreover, regular inductive generalization does not exploit possible dependencies between innards.

Fig. 2 shows a variant of inductive generalization that is better suited for generalizing lemmas over innards. The first step (line 1), consists of sorting the nets in the lemma, from the nets that we want to remove most to the nets that we want to remove least. In particular, we want to prioritize removal of latches, so as to obtain a different lemma that we started with. In our current implementation, we sort the nets by their *logic level*, so that latches have the lowest level and deeper nets in general have higher level. This way deeper nets are considered "more interesting" and the algorithm attempts to remove shallower nets first. Other heuristics can be considered as well, e.g., sorting the nets by the *size of the supporting logic*, or even dynamic heuristics that measure the *activity* of a net in previously generalized lemmas.

The main loop (lines 3–12) corresponds to inductive generalization in regular IC3: essentially, we remove literals of C one by one, as long as C remains relatively inductive. We provide a detailed description of one iteration of the loop. Suppose that c<sup>i</sup> is the literal under consideration.

1) Note that *multiple* literals can be removed from C in a single iteration of the loop (this optimization is also present in regular IC3 inductive generalization), so at the start of the iteration (line 3), we check if c<sup>i</sup> has already been removed. If so, nothing needs to be done.

2) Lines 4–5 correspond to a special optimization that exploits dependencies between innards: in some cases, we can detect that c<sup>i</sup> can be removed without requiring a SAT query. For instance, c<sup>i</sup> can be removed when one of the following conditions holds:


For example, suppose that C = (a∨c∨d) and {d = (b∨c)} ∈ Tr inn. Then, modulo Tr inn, C ⇔ (C \ c), i.e. (a ∨ c ∨ d) can be replaced by (a∨d). This closely corresponds to *hidden literal elimination* technique in SAT [16], and can be viewed as the inverse of the argument used in ExtendLemma.

3) Line 6 checks whether c<sup>i</sup> can be removed using two SATqueries. One query checks the validity of Init <sup>d</sup> <sup>⇒</sup> (<sup>C</sup> \ <sup>c</sup>i), by checking whether Init <sup>d</sup> ∧ ¬(<sup>C</sup> \ <sup>c</sup>i) is unsatisfiable. The other query checks the validity of <sup>F</sup><sup>k</sup> <sup>∧</sup> Tr<sup>c</sup> <sup>∧</sup>(<sup>C</sup> \ <sup>c</sup>i) <sup>⇒</sup> (<sup>C</sup> <sup>0</sup> \ c 0 i ) by checking whether <sup>F</sup>k∧Trc∧(C\ci)∧¬(<sup>C</sup> <sup>0</sup>\c 0 i ) is unsatisfiable. If both of these queries are unsatisfiable, c<sup>i</sup> can be removed. 4) IC3 has the following standard optimization based on considering which of the literals of (C \ ci) were potentially required for unsatisfiability: if c<sup>j</sup> ∈ C was not required for either checks, then c<sup>j</sup> can be removed. This is typically implemented by passing the literals of ¬(C \ ci) via SAT *assumptions* and analyzing the set of *conflicting assumptions*; a mechanism supported by most modern SAT-solvers, following MINISAT [17]. However, simply removing all non-required literals regardless of their order in C is more likely to remove the "more interesting" literals that we want to keep. So, our variant of this optimization (lines 8–12) only removes nonrequired literals with respect to the order. As an example, suppose that C = (c<sup>1</sup> ∨ c<sup>2</sup> ∨ c<sup>3</sup> ∨ c<sup>4</sup> ∨ c<sup>5</sup> ∨ c6) (in this order), and that only the literals c<sup>4</sup> and c<sup>6</sup> were potentially required for unsatisfiability queries involving C \ c1. In addition to removing c1, we also remove c<sup>2</sup> and c3, but not c5, and at the end of the iteration of the loop, C = (c<sup>4</sup> ∨c<sup>5</sup> ∨c6). Intuitively, this works better because leaving c<sup>5</sup> in the lemma increases the chances to remove c<sup>5</sup> and to leave c<sup>6</sup> (and not vice versa) on the following iterations of the loop. Lastly, in most cases an assumption-based SAT-solver applies assumptions in the order as they are given, hence, the assumptions appearing earlier are more likely to remain (while later assumptions are more likely to be removed). Therefore, when performing the SAT queries, we *reverse* the order of assumption literals, for instance when checking whether c<sup>1</sup> can be removed from C = (c<sup>1</sup> ∨ c<sup>2</sup> ∨ c<sup>3</sup> ∨ c<sup>4</sup> ∨ c<sup>5</sup> ∨ c6), the assumptions are ordered from c<sup>6</sup> to c<sup>2</sup> (and not from c<sup>2</sup> to c6).

Note that during the regular inductive generalization (i.e.,

when computing the original lemma over latches) it is beneficial to make multiple passes over the main loop (lines 3–12). However, when generalizing lemmas over innards, performing multiple passes has not proven to be useful, so we only perform a single pass.

Lemma 3 *Given a lemma* C<sup>1</sup> *over latches and innards, the* InductivelyGeneralize *procedure returns a lemma* C<sup>2</sup> *that is relatively inductive with respect to* Fk*.*

Going back to our running example, suppose that C<sup>1</sup> = (w ∨ x ∨ g) is inductive relative to F<sup>k</sup> = >. The procedure SortLemma is not likely to change the order of nets, as the latches already appear first. On the first iteration of the main loop, we attempt to remove w, but this fails as the SAT query > ∧ Tr<sup>c</sup> <sup>∧</sup> (<sup>x</sup> <sup>∨</sup> <sup>g</sup>) ∧ ¬<sup>x</sup> <sup>0</sup> ∧ ¬g 0 is satisfiable. On the second iteration, we attempt to remove x, and succeed, reducing C<sup>1</sup> to (w ∨g). Finally, we attempt to remove g, which again fails. The final lemma returned by the algorithm is C<sup>2</sup> = (w ∨ g).

#### V. EXPERIMENTS

In this section, we present our experimental results. The techniques described in this paper are implemented in the IBM formal verification tool *Rulebase: Sixthsense Edition* [18]. In what follows, we denote by IC3 the default variant of IC3 used by the tool (see [6]), and by IC3-INN the variant with the additional learning of lemmas over innards. For these experiments, we restrict to input-free innards. Table I summarizes the experiments. The table contains the benchmark set (explained in detail later), the number of instances in this set, time-limit per instance, and the data on performance of IC3 and IC3-INN. All the instances either are or expected to be unsatisfiable. For both IC3 and IC3-INN, we list the number of solved instances, and in parentheses – the number of uniquely solved instances (that is, not solved by the other configuration), and the cumulative runtime in seconds. Next, we describe each benchmark set in detail.

#### *A. IBM-AOD-SEC*

This set of benchmarks comes from checking sequential equivalence between two designs in the Aspect Oriented Design flow at IBM. This SEC problem is very challenging, and is traditionally solved as described in [8], [9], using speculative reduction to reduce the problem into multiple simpler (but still hard) sub-problems. These are then solved using a dedicated engine configuration consisting of combinational rewriting, k-induction, localization, and, eventually, a proofbased technique like IC3. Historically, Interpolation (IMC) was used for the final step. Generally IMC works well, but unfortunately, it's not stable – small changes in the design or in the solving configuration significantly affect verification times. While trying to find an alternative configuration, it was discovered that IC3 performs very poorly, while IC3-INN significantly outperforms all other approaches.

In total, there are 3 605 sub-problems. Each sub-problem contains 1–45 properties, 11–165 state elements, 126–2 290 inputs, and 754–15 924 gates. The (input-free) innards on

Fig. 3. Performance of IC3 and IC3-INN on AOD SEC benchmarks.

average constitute 3% of the gates. For this experiment, we run both IC3 and IC3-INN with a time-limit of 300 seconds per problem. Referring to Table I, regular IC3 peforms very poorly: it can solve only 2 562 of the sub-problems and times out in the 1 043 remaining cases. On the other hand, IC3-INN performs extremely well: it can solve all of the problems, with the maximum run-time being only 36 seconds. Interestingly, IMC performs much better than IC3 on this set of problems and is also able to solve all problems (albeit about 13 times slower than IC3-INN). See the cactus plot in Fig. 3a for the detailed comparison between IC3, IC3-INN, and IMC.

A further comparison consists of comparing the number of lemmas in the safe inductive invariants discovered by IC3 and IC3-INN respectively. The scatter plot Fig. 3b shows this data for the 2 562 instances solved by both configurations. We can see that IC3-INN discovers invariants that are significantly more compact, with the inductive invariants discovered by IC3-INN being on average 12× smaller than the invariants discovered by IC3. This partially explains the success of IC3- INN compared to IC3 on this set of benchmarks.

We also give data on the effectiveness of LearnAdditionalLemma, averaged across all 3 605 test-cases. On average, the original lemma C (over latches) has 7 latches; ExtendLemma adds 10 innards; InductivelyGeneralize shrinks the lemma to 2 latches and 1 innards. The average logic level of innards is 7. Thus, LearnAdditionalLemma is able to produce significantly shorter lemmas using deep innards in the design.

Unfortunately, this benchmark set is proprietary and cannot be publicly released at this time.

#### *B. 6s119-SEC, 6s22-SEC*

Inspired by the success of IC3-INN on internal IBM benchmarks, we tried to manually create similar test-cases starting from publicly available benchmarks. Specifically, we have taken several HWMCC designs, and created problems to check sequential equivalence between the original design and the retimed design [12]. We have further applied the SEC flow described above, consisting of breaking the main problem into multiple sub-problems using speculative reduction. It turns out that creating interesting benchmark sets in this way is non-trivial: in many cases the speculatively reduced problems turn out to be very easy, in many other cases some of these speculatively reduced problems turn out to be satisfiable (in


Fig. 4. Runtime of IC3 and IC3-INN on 6s119-SEC and 6s22-SEC.

the real SEC flow this would trigger refinement and another speculative reduction). Nevertheless, we have created two benchmark sets 6s22-SEC and 6s119-SEC, available at https:// github.com/agurfinkel/innard-benchmarks. The set 6s119-SEC consists of 364 rather easy problems, so that both IC3 and IC3-INN can solve all of them within 600 seconds, with IC3- INN being about 2.4× faster. The set 6s22-SEC consists of 310 problems, out of which IC3 can solve 262 problems and IC3-INN can solve 278 within 600 seconds. Please refer to Table I. Again, IC3-INN performs better than IC3, and is on average 1.3× faster. A more precise comparison is given in scatter plots in Fig. 4. A detailed comparison against IMC is not included as on both sets of problems IMC performs significantly worse than either IC3 or IC3-INN (for instance, within 600 seconds IMC cannot solve 64 out of 364 problems even for the easy set 6s119-SEC).

#### *C. Other SEC benchmarks; AES-SEC*

As far as we know, there are no publicly available large SEC benchmark sets. HWMCC competitions do include several SEC benchmarks. However, in general we do not know which benchmarks come from SEC or what kind of application they represent. We believe it would be valuable to have a dedicated repository for SEC benchmarks.

The AES-SEC benchmark set was used in [13]. We have obtained this set from the authors of [13] in BTOR format, and translated it to AIGER. The AIGER benchmarks are available at https://github.com/agurfinkel/innard-benchmarks. In total, there are 16 problems, 12 of which turn out to be very easy for both IC3 and IC3-INN. Out of the 4 remaining

Fig. 5. Runtime of IC3 and IC3-INN on HWMCC benchmarks.

problems, IC3 can solve 1, and IC3-INN can solve 3. Please see Table I for details.

### *D. HWMCC benchmarks*

We have run extensive experiments on the singleproperty benchmarks from HWMCC'11, HWMCC'17 and HWMCC'20 competitions (for the latter, we used the benchmarks in the AIGER format). In each case, we run simple combinational reductions prior to running IC3, and used the time-limit of 3 600 seconds. In Table I, we only report data for passing benchmarks that were solved either by IC3 or IC3- INN. In general, IC3-INN performs worse than IC3 both in terms of the number of properties solved and the total runtime. Detailed comparisons are presented as scatter plots in Fig. 5.

Table II presents data for 4 selected benchmarks. The benchmark *rast-p16* is very interesting: regular IC3 times out, yet IC3-INN solves the testcase in just 2 seconds. Futhermore, this benchmark was solved by relatively few tools in the HWMCC'20 competition. By closely examining the lemmas learned by IC3-INN exposed the pattern from Example 4 from Section III. In other words, IC3-INN learns lemmas over innards, each equivalent to a very large number of lemmas over latches. This potentially explains the success of IC3-INN in this case. Another noteworthy benchmark is *zipversa composecrc prf-p10*, which IC3-INN solves under 5 minutes, and which was solved only by one tool in the HWMCC'20 competition. The other two benchmarks exposed a certain inefficiency in our current implementation of IC3-INN. One can check that there are significantly more innards in the selected test-cases (and in HWMCC testcases in general) as compared to IBM-AOD-SEC designs. The procedure InductivelyGeneralize starts taking a significant portion of the overall runtime, which negatively

TABLE II SELECTED DESIGNS FROM HWMCC'20


affects performance of IC3 when the lemmas over innards do not seem to help.

#### VI. RELATED AND FUTURE WORK

The technique presented in this paper can be viewed as an extension of regular IC3 that simply learns an additional lemma during inductive generalization. As such, it is reasonably easy to integrate it in an existing IC3 implementation. The main technical point being replacing Init by Init <sup>d</sup> and Tr by Tr<sup>c</sup> in IC3's SAT queries. The key difference with other inductive generalization schemes (see for instance [3]) is that we are able to learn lemmas over both state variables and internal nets, which, in some cases, may exponentially reduce the size of the inductive invariant.

Backes and Riedel [7] also exploit internal nets in the design. However, the two approaches are very different: [7] uses input-free innards to generalize proof obligations (POBs), while we use arbitrary innards to generalize lemmas. Additionally, [7] uses only *input-free* innards (and, in fact, only the nets on the *boundary* between input-free and non inputfree parts of the netlist), while we use all internal nets. Even more importantly, in our work the decision of which innards to include in the lemma was based on the ability to inductively generalize this lemma and not whether the innards are "boundary" or not. Above notwithstanding, it is interesting to combine the two approaches, i.e., to allow both proofobligations and lemmas over internal nets. It is also interesting to more carefully integrate our approach with Quip [6]. Quip uses negations of lemmas as proof obligations, which would also introduce innards into POBs.

Another very interesting direction for further research is to extend the approach to learn lemmas over signals that are not present in the original netlist. Our framework allows such an extension: by including additional logic into the netlist (that is, creating additional innards), we would be able to learn lemmas over this new logic (even if this new logic is not in the coneof-influence of the original problem!). This is closely related to implicit predicate abstraction of Tonetta et al. [19] that is used to lift propositional IC3 to SMT-based logics.

Finally, we believe that there is a lot of room to improve the current implementation. Currently, when there are many innards in the design, the procedure InductivelyGeneralize may require a large number of SAT queries, and, hence, may take a considerable portion of the overall runtime. Possibly, one can find better heuristics of which innards to consider (e.g., only to consider innards with high logic level, or only to consider *higher-priority* innards), or find more efficient procedures to perform inductive generalization (e.g., instead of the top-down approach that removes literals one can consider a bottom-up approach that adds literals). In the worst-case, if learning additional lemmas takes a considerable amount of time, but does not seem useful, the technique can be simply turned off.

A further extension of our approach is to allow lemmas to be arbitrary formulas, not restricted to clauses in CNF. This is commonly done in SMT-based extensions of IC3 algorithms. For example, Sally [20] uses arbitrary SMT-formulas as lemmas, and Spacer [21] uses clauses over complex First Order signature. However, these techniques are difficult to port efficiently in the context of Hardware Model Checker since they rely on dynamic cnfization that is common in SMTsolvers but not in SAT-solvers.

### VII. CONCLUSION

Currently, IC3 is unquestionably the most effective technique for formal symbolic model checking. It has received a lot of research attention, and has been extended in variety of ways including better inductive generalization, better lemma management, and search direction. However, one significant hidden limitation remains – IC3 is limited to learning inductive invariants in CNF over the latches (i.e., state variables) of the design. Therefore, IC3 cannot be effective for any design whose invariant has no concise CNF representation. No improvements in core IC3 parts can solve this problem.

In this paper, we propose to address this limitation by extending IC3 to learn lemmas not only over latches, but also over internal signals, that we call *innards*. We show learning lemmas over innards is a natural generalization of *inductive generalization*. Instead of simply dropping literals to strengthen the lemma, we propose to replace literals by internal signals that are forced by them. We also propose several improvements to a naive strategy that lead to significantly improved performance.

Our work is motivated by a specialized set of Sequential Equivalence Checking (SEC) benchmarks at IBM. These benchmarks have been traditionally difficult for IC3, but not for Interpolation (IMC). However, the performance of interpolation was not stable – being affected by small changes in the verification flow. Our new implementation excels on these benchmarks and leads to an order of magnitude improvement in performance.

Unfortunately, similar performance gains do not manifest on the publicly available HWMCC benchmarks that are the de-facto metric for academic model checking research. We believe this shows deficiency in the currently available benchmarks. Techniques that might be effective in industry might be missed by researchers since they do not perform well on these benchmarks. To remedy this, we identified some publicly available benchmarks, and created new benchmarks based on SEC flow, that illustrate the advantage of our technique. We hope this can stimulate further research and improvements to IC3.

In the current work, we assume that the design is fixed, and use internal signals that are already available. We think that this opens an interesting direction by allowing IC3 to change the design by synthesizing new innards that are useful for a current verification run. This brings IC3 and interpolation much closely together, and also paves way for bringing algorithms from hardware verification to software verification, and/or to word level.

#### ACKNOWLEDGMENTS

The authors would like to thank Jason Baumgartner, Robert Kanzelman, Raj Kumar Gajavelly, Ziv Nevo, Hongce Zhang, Sharad Malik, Alan Mishchenko, and Baruch Sterin. This work was supported, in part, by Individual Discovery Grant from the Natural Sciences and Engineering Research Council of Canada and IBM Faculty Fellowship.

#### REFERENCES


# Single Clause Assumption without Activation Literals to Speed-up IC3

Nils Froleyks nils.froleyks@jku.at *Johannes Kepler University, Linz, Autstria*

Armin Biere biere@cs.uni-freiburg.de *Albert–Ludwigs–University, Freiburg, Germany*

*Abstract*—We extend the well-established assumption-based interface of incremental SAT solvers to clauses, allowing the addition of a temporary clause that has the same lifespan as literal assumptions. Our approach is efficient and easy to implement in modern CDCL-based solvers. Compared to previous approaches, it does not come with any memory overhead and does not slow down the solver due to disabled activation literals, thus eliminating the need for algorithms like IC3 to restart the SAT solver. All clauses learned under literal and clause assumptions are safe to keep and not implicitly invalidated for containing an activation literal. These changes increase the quality of learned clauses, resulting in better generalization for IC3. We implement the extension in the SAT solver CaDiCaL and evaluate it with the IC3 implementation in the model checker ABC. Our experiments on the benchmarks from a recent hardware model checking competition show a speedup for the average SAT call and a reduction in number of calls per verification instance, resulting in a substantial improvement in model checking time.

#### INTRODUCTION

Modern SAT solving is based on Conflict-Driven Clause Learning (CDCL) [1]. Many applications require solving a sequence of related SAT problems incrementally [2], [3], making use of inprocessing techniques [4], [5], [6] that make modern SAT solvers so efficient. Among those applications is the symbolic model checking algorithm IC3. In contrast to other incremental SAT-based techniques, such as bounded model checking (BMC) [7], [8] and k-induction [9], [10], IC3 does not rely on unrolling the transition function. As a result the SAT queries that IC3 poses are significantly smaller and faster to solve. However, the number of queries that IC3 makes over the course of one model checking procedure is significantly higher. We illustrate the kind of queries that IC3 makes in the following example.

Fig. 1. Transition system

Consider the transition system of a three-bit (b2b1b0) counter, encoding integers up to seven, in Fig. 1. Nondeterministically, the counter is incremented, remains unchanged or is reset to zero after reaching five. Suppose we want to ensure that starting at state zero, all states with values greater than five are unreachable. A typical query asks "is state six reachable from any other state?", expressed as SAT?[T ∧ (¬b<sup>2</sup> ∨ ¬b<sup>1</sup> ∨ b0) ∧ b 0 <sup>2</sup> ∧ b 0 <sup>1</sup> ∧ ¬b 0 0 ], where T encodes the transition system for one step from b2b1b<sup>0</sup> to b 0 2 b 0 1 b 0 0 . It is unsatisfiable, telling us that state six is in fact unreachable. We can try to generalize this result to a set of states by considering a *cube* – an assignment to a subset of variables. The query SAT?[T ∧ (¬b<sup>1</sup> ∨ b0) ∧ b 0 <sup>1</sup> ∧ ¬b 0 0 ] is satisfiable because state two can be reached from state one and SAT?[T ∧ (¬b<sup>2</sup> ∨ b0) ∧ b 0 <sup>2</sup> ∧ ¬b 0 0 ] is satisfiable due to the transition from state three to state four. However, the query SAT?[T ∧ (¬b<sup>2</sup> ∨ ¬b1) ∧ b 0 <sup>2</sup> ∧ b 0 1 ] is unsatisfiable, allowing us to conclude that all states in the cube b2∧b<sup>1</sup> are not reachable from outside the cube. We can use that insight to strengthen T by adding ¬b 0 <sup>2</sup>∨¬b 0 1 to all future queries. This is in contrast to the clauses we previously added for only one query.

The popular assumption-based interface pioneered by MiniSat [2], [8] allows the user to specify a set of literals that are assumed to be true and picked by the solver as the first decisions. This allows us to add the assumption that a state is within a certain cube after the transition (b 0 <sup>2</sup> ∧ b 0 1 ), however we still need to assume an additional clause encoding that the state is currently not within said cube (¬b<sup>2</sup> ∨ ¬b1). The most common way to implement clause assumption, is to simulate the desired behavior using activation literals [8], [11]. Let C be a clause to add temporarily and a, the activation literal, a free variable, *i.e.,* it does not occur in the formula. By adding C ∨a to the formula and assuming ¬a, we achieve the same as adding C to the formula. After a solution is found, the clause a is added, effectively removing C from the formula.

The problem with IC3 specifically, is the large number of queries made over the course of a single verification procedure. After a few hundred calls the activation literals clutter up the variable space and slow down the SAT solvers propagation. The common solution to this problem is to fully restart the SAT solver by replacing it with a fresh instance periodically, thus also deleting all learned clauses and heuristic scores. How to schedule these restarts in IC3 specifically, has been the topic of a full journal paper [12]. Using the technique presented in this paper, restarts are not necessary at all. Additionally learned clauses are safe to keep and will not contain an activation literal, which would make them useless for future calls.

Other approaches to clause assumption have been explored: The logic solver Satire [13] supports pseudo-Boolean and

other constraints. It records the dependencies of learned constraints explicitly, thus allowing the deletion of arbitrary clauses. In the SMT community, an interface based on pushing and popping on the assertion stack is prevalent [14]. Since constraints are removed in order, it is possible to mark a point in the data structures that maintain learned knowledge and remove everything past it, when a pop operation is executed. The first implementation of IC3 [15] used the SAT solver Zchaff [16]. It assigns an additional 32-bit integer to each clause. When learning a clause the bits of all dependencies are combined. The user can delete a group of clauses with a certain bit. This approach mostly simulates the use of activation literals and comes with a significant memory overhead.

This paper presents an extension of the prevalent assumption mechanism to additionally allow the assumption of a single clause, called *constraint* in the following. The extension can be implemented by a simple modification to the decision mechanism in a CDCL-based SAT solver. We implemented it in under 100 lines of code in the state-of-the-art SAT solver CaDiCaL. To evaluate our implementation we modify the IC3 engine in the model checker ABC to use CaDiCaL and clause assumption. As a first result, the changes simplify SAT solver usage and eliminate the need for restarts as well as some bookkeeping for activation literals. An empirical evaluation on the 2019 hardware model checking competition [17] benchmark set shows that ABC spends less time outside of computing SAT queries, the number of queries per verification is reduced and the average SAT call is faster. Overall using clause assumptions yields a substantial speedup in verification time.

#### INCREMENTAL SAT AND IC3

An *incremental* SAT solver solves a series of related formulas efficiently. It communicates with an application integrating it through an *interface* such as IPASIR [11]. It is implemented by all solvers participating in the incremental library track of the SAT Competition since 2015. The popular solver MiniSat along with all of its incremental descendants implement something very similar. We describe the relevant subset:


A prominent applications of incremental SAT-solving is the symbolic model checking algorithm IC3 by Bradley [15]. Given a transition system and a property P, IC3 tries to prove that it is not possible to reach a state that violates the property. It maintains a sequence of *frames* F0, F1, . . . Fk, each frame F<sup>i</sup> is a formula encoding an overapproximation of the set of states reachable in at most i steps. The frames are refined by adding additional clauses until one of the frames contains all reachable states and none violates the property or a counterexample is found. Each frame has its own SAT solver instance that is initialized with an encoding of the transition function and updated with the new frame clauses.

The solvers are used almost exclusively to answer queries for predecessors of the form SAT?[T ∧ F<sup>i</sup> ∧ ¬s ∧ s 0 ], where T is the transition function and s is a cube. To refine the frames, a state s in the last frame that violates the property is identified with the query SAT?[F<sup>k</sup> ∧ ¬P]. If no such state exists, a new frame is appended, otherwise IC3 tries to prove that the state is not actually reachable. The frames are queried for predecessors until an initial state is reached, thus producing a counterexample, or one of the frames returns unsat. In the latter case failed can be used to generalize the unreachable state to a cube, the negation of which is added to the frame. IC3 is guaranteed to eventually terminate with two consecutive frames containing the same set of states.

#### ASSUMING CLAUSES

Our main contribution is an extension to incremental SAT solvers that allows the assumption of an additional clause, called *constraint*, which is only valid during the next satisfiability query. Two functions are added to the interface:


Our approach is similar to the idea of model elimination [18]. We modify the decision heuristic to restrict the search to assignments that satisfy the constraint. The modified decision procedure is outlined in Fig. 2. The function decide is called initially at decision level 0. Decisions assigned to the trail are propagated outside of the function to assign truth values. Whenever a conflict arises, the decision level decreases and the assignments are backtracked [1]. Every assumption has a fixed decision level. In the case where an assumption is already satisfied, a *pseudo* decision level is introduced. Otherwise if an assumed literal is assigned to false at this point, the assignment is the result of propagating other assumptions together with original or learned clauses. Therefore the formula is proven unsatisfiable under the current assumptions if line 4 is reached.

At the first decision level after all assumptions have been assigned, three cases need to be considered: if one of the literals in the constraint is already satisfied, the search is not restricted. Otherwise one of the literals is picked as a decision to satisfy the constraint. In line 13 a variable selection heuristic can be used to pick the most promising literals first, similarly to [19], [20]. In the case where all literals are assigned to false, they are implied by the assumptions, thus cannot be assigned differently. The formula is therefore declared unsatisfiable under the assumptions and the constraint. This might only happen after additional clauses have been learned.

This approach to handle assumptions was pioneered by MiniSat [2]. It has been improved upon by collectively propagating the assumptions, using trail saving between incremental

#### decide ( )


Fig. 2. Algorithm decide picks the next decision to propagate.

calls [21] or factoring out assumptions [22]. These techniques can be combined with the presented constraint mechanism.

Modern SAT solvers not only report unsatisfiability as a result, but also allow the user to query whether a particular assumption failed, *i.e.,* was used to prove unsatisfiability. This concept, introduced as analyzeFinal by MiniSat [23], is essential for the efficiency of many applications. If an original or learned clause is inconsistent with the assumptions, the last assumption picked as a decision is already assigned to false. Using a simple breadth-first search, the reasons for this assignment can be traced back through the implication graph [1]. The assumptions at the leaves of the search tree are marked as failed. In line 16, a similar search is initialized with the negation of every literal in the constraint. Thus, all assumptions necessary to prove unsatisfiability of the constraint in conjunction with the formula are marked as failed.

#### EXPERIMENTS

We implemented the constraint interface in CaDiCaL [24] version 1.3.1. To increase confidence in the correctness of the SAT solver and its new extension, we used the modelbased tester [25] that is integrated with CaDiCaL. It generates random sequences of API calls including assumptions and constraints together with random configurations for the solver. The returned models and failed assumption sets are checked for correctness. We ran the tester on 8 cores for multiple days to validate 1.2 billion test runs.

To evaluate our approach, we integrated CaDiCaL into the bit-level model checker ABC<sup>1</sup> [26], replacing the integrated version of MiniSat [2]. There are two places where activation literals are used in ABC. The first is an alternative implementation of cube generalization, that is not used in the default configuration. In fact, it seems to not work correctly in the default version of ABC<sup>1</sup> . The other usage of activation literals is in the function that implements the predecessor query SAT?[T ∧ F<sup>i</sup> ∧ ¬s ∧ s 0 ]. The transition function T and the frame F<sup>i</sup> will only be extended with additional clauses, the cube s however changes at each query. The next-step cube s 0 is in conjunction with the rest of the formula and therefore translates to a set of unit clauses that can be implemented with assumptions. To combat the slowdown due to unused activation literals cluttering up the variable space, ABC replaces the SAT solver with a new instance after adding 300 activation literals. Using the extended interface, the negated cube ¬s can be added as a constraint, thus eliminating the restarts.

We tested five configurations: the original version of ABC (Og), disabled SAT solver restarts (Di), a version with CaDiCaL as backend using activation literals (Ca) and one also using CaDiCaL but the new constraint interface instead of activation literals (Co). As an additional result we present a slight modification to the last configuration that defers model reconstruction [6] in the SAT-case and failed literal collection in the UNSAT-case until a model or a failed literal is queried respectively (De). Using a heuristic to pick the literals from the constraint has not been successful. ABC uses a priority metric to order the literals of the cube s by default. Using this order for the constraint turned out to be superior to the heuristics available in CaDiCaL.

Our evaluation follows the principles laid out in SAT manifesto v1.0. [27]. The source code used for the evaluation and the generated log files are available on our website<sup>2</sup> . The experiments are run in parallel on 32 nodes of our cluster. Each node has access to two 8-core Intel Xeon E5-2620 v4 CPUs running at 2.10 GHz (turbo-mode disabled) and 128 GB main memory. We allocate 4 instances of ABC to every node. The time limit is set to 1 hour of wall-clock time, memory is limited to 30GB per instance. The memory limit is the only aspect that differs from the setup used in the hardware model checking competition. However, the maximum memory consumption was observed to be below 1.5GB.

The evaluation is based on the benchmark set used in the 2019 model checking competition [17]. It contains 219 instances, 15 of which we removed because they were not solved by any tested configuration. We use PAR-2 scoring to compare the configurations. PAR-2 assigns the runtime in seconds or twice the time limit (7200) if an instance was not solved. The other columns list additional measurements for the two configurations using CaDiCaL, one with activation literals (Ca) and the other using constraints instead (Co). The number of restarts is zero if constraints are used and

1 commit f87c8b4 <sup>2</sup>http://fmv.jku.at/assumingclauses


TABLE I EXPERIMENTAL RESULTS.

Disable restarts, Original version of ABC, CaDiCaL backend, Constraint interface used, Defer model reconstruction

therefore not shown. Besides that, we list the number of SAT calls (in thousands), along with the average time per call in milliseconds. Table I presents the measured data for instances, where at least one configuration took more than two seconds, along with an average over all 204 instances.

Comparing the first two columns, it is evident that if activation literals are used, solver restarts are necessary. It has been suggested [12] that because the queries posed by IC3 are small but numerous, IC3 implementations should prefer faster SAT solvers to more powerful ones. Comparing the original with the CaDiCaL version shows that while using MiniSat is faster on a number of instances, using CaDiCaL seems to be an advantage on the harder instances. In fact, using the newer SAT solver, one additional instance can be verified. Over all instances a speedup of 2.82 is observed.

With the version using CaDiCaL and activation literals as a baseline, we observe a speedup of 1.84 when switching to constraints. The time spend outside the SAT solver is reduced to below 20%, by eliminating the actual SAT solver restarts and the repeated loading of the transition relation [28]. Beyond that, the average SAT call is 16% faster. This can partially be explained by the solver not being slowed down by activation literals. We conjecture that, more importantly, the "quality" of the learned clauses in the solvers database is higher. Since clauses are not deleted by restarts and none of the learned clauses are implicitly disabled for containing an activation literal, the solver can profit from shorter and more useful clauses. Measuring this quality however, is outside the scope of this paper. An additional effect is that these clauses allow conflicts earlier in the search tree, resulting in fewer failed literals and thus allows for better generalization in IC3. This can explain why 21% fewer calls are made.

The last two columns listing PAR-2 scores reflect small changes in the solver. Deferring the model reconstruction results in an additional speedup of 9%, increasing the total speedup compared to the original version to 5.64.

#### CONCLUSION

We present a simple extension to the commonly used incremental SAT solver interface IPASIR that simplifies solver usage and is easy to implement by modern SAT solvers. The extension gives an alternative to the techniques described in the journal paper [12] and partially implemented in ABC. Our experiments using the new technique with ABC show a substantial improvement in model checking time. Compared to the original IC3 engine, our final implementation is more than five times faster.

Handling more than one constraint can be achieved by using a complete model elimination search over the constraints. This would however increase the implementation effort. Additionally, inprocessing techniques cannot be applied, therefore model elimination might be less effective than using activation literals, if the number of temporary clauses is high. We leave this investigation to future work.

*Acknowledgements:* This work is supported by the Austrian Science Fund (FWF); projects W1255-N23 / S11408-N23 as well as the LIT AI Lab funded by the State of Upper Austria.

#### REFERENCES


# Logical Characterization of Coherent Uninterpreted Programs

Hari Govind V K *University of Waterloo* Sharon Shoham *Tel-Aviv University*

Arie Gurfnkel *University of Waterloo*

*Abstract*—An uninterpreted program (UP) is a program whose semantics is defned over the theory of uninterpreted functions. This is a common abstraction used in equivalence checking, compiler optimization, and program verifcation. While simple, the model is suffciently powerful to encode counter automata, and, hence, undecidable. Recently, a class of UP programs, called coherent, has been proposed and shown to be decidable. We provide an alternative, logical characterization, of this result. Specifcally, we show that every coherent program is bisimilar to a fnite state system. Moreover, an inductive invariant of a coherent program is representable by a formula whose terms are of depth at most 1. We also show that the original proof, via automata, only applies to programs over unary uninterpreted functions. While this work is purely theoretical, it suggests a novel abstraction that is complete for coherent programs but can be soundly used on *arbitrary* uninterpreted (and partially interpreted) programs.

### I. INTRODUCTION

The theory of Equality with Uninterpreted Functions (EUF) is an important fragment of First Order Logic, defned by a set of functions, equality axioms, and congruence axioms. Its satisfability problem is decidable. It is a core theory of most SMT solvers, used as a glue (or abstraction) for more complex theories. A closely related notion is that of Uninterpreted Programs (UP), where all basic operations are defned by uninterpreted functions. Feasibility of a UP computation is characterized by satisfability of its path condition in EUF. UPs provide a natural abstraction layer for reasoning about software. They have been used (sometimes without explicitly being named), in equivalence checking of pipelined microprocesors [1], and equivalence checking of C programs [17]. They also provide the foundations of Global Value Numbering (GVN) optimization in many modern compilers [6], [8], [12].

Unlike EUF, reachability in UP is undecidable. That is, in the *lingua franca* of SMT, the satisfability of Constrained Horn Clauses over EUF is undecidable. Recently, Mathur et al. [9], have proposed a variant of UPs, called *coherent uninterpreted program* (CUPs). The precise defnition of coherence is rather technical (see Def. 3), but intuitively the program is restricted from depending on arbitrarily deep terms. The key result of [9] is to show that both reachability of CUPs and deciding whether an UP is coherent are decidable. This makes CUP an interesting infnite state abstraction with a *decidable* reachability problem.

Unfortunately, as shown by our counterexample in Fig. 4 (and described in Sec. VI), the key construction in [9] is incorrect. More precisely, the proofs of [9] hold only of CUPs restricted to unary functions. In this paper, we address this bug. We provide an alternative (in our view simpler) proof of decidability and extend the results from reachability to arbitrary model checking. The case of non-unary CUPS is much more complex than unary. This is not surprising, since similar complications arise in related results on Uniform Interpolation [4] and Cover [5] for EUF.

Our key result is a logical characterization of CUP. We show that the set of reachable states (i.e., the strongest inductive invariant) of a CUP is defnable by an EUF formula, over program variables, with terms of depth at most 1. That is, the most complex term that can appear in the invariant is of the form v ≈ f(w⃗ ), where v and w⃗ are program variables, and f a function.

This characterization has several important consequences since the number of such bounded depth formulas is fnite. Decidability of reachability, for example, follows trivially by enumerating all possible candidate inductive invariants. More importantly from a practical perspective, it leads to an effcient analysis of *arbitrary* UPs. Take a UP P, and check whether it has a safe inductive invariant of bounded terms. Since the number of terms is fnite, this can be done by implicit predicate abstraction [3]. If no invariant is found, and the counterexample is not feasible, then P is not a CUP. At this point, the process either terminates, or another verifcation round is done with predicates over deeper terms. Crucially, this does not require knowing whether P is a CUP apriori – a problem that itself is shown in [9] to be at least PSPACE.

We extend the results further and show that CUPs are bisimilar to a fnite state system, showing, in particular, that arbitrary model checking for CUP (not just reachability) is decidable.

Our proofs are structured around a series of abstractions, illustrated in a commuting diagram in Fig. 1. Our key abstraction is the base abstraction αb. It forgets terms deeper than depth 1, while maintaining all their consequences (by using additional fresh variables). We show that α<sup>b</sup> is sound and complete (i.e., preserves all properties) for CUPs (while, sound, but not complete for UP). It is combined with a cover abstraction αC, that we borrow from [5]. The cover abstraction ensures that reachable states are always expressible over program variables. It serves the purpose of existential quantifer elimination, that is not available for EUF. Finally, a renaming abstraction α<sup>r</sup> is a technical tool to bound the occurrences of constants in abstract reachable states.

Fig. 1: Sequence of abstractions used in our proofs.

The rest of the paper is structured as follows. We review the necessary background on EUF in Sec. II. We introduce our formalization of UPs and CUPs in Sec. III. Sec. IV presents bisimulation inducing abstractions for UP. Sec. V presents our base abstraction and shows that it induces a bisimulation for CUPs. Sec. VI develops logical characterization for CUPs, presents our decidability results, and shows that a fnite state abstraction of CUPs is computable. We conclude the paper in Sec. VII with summary of results and a discussion of open challenges and future work.

#### II. BACKGROUND

We assume that the reader is familiar with the basics of First Order Logic (FOL), and the theory of Equality and Uninterpreted Functions (EUF). We use Σ = (C, F, {≈, ̸≈}) to denote a FOL signature with constants C, functions F, and predicates {≈, ̸≈}, representing equality and disequality, respectively. A term is a constant or (well-formed) application of a function to terms. A literal is either x ≈ y or x ̸≈ y, where x and y are terms. A formula is a Boolean combination of literals. We assume that all formulas are quantifer free unless stated otherwise. We further assume that all formulas are in Negation Normal Form (NNF), so negation is defned as a shorthand: ¬(x ≈ y) ≜ x ̸≈ y, and ¬(x ̸≈ y) ≜ x ≈ y. Throughout the paper, we use ▷◁ to indicate a predicate in {≈, ̸≈}. For example, {x ▷◁ y} means {x ≈ y, x ̸≈ y}. We write ⊥ for false, and ⊤ for true. We do not differentiate between sets of literals Γ and their conjunction ( ⋀ Γ). We write depth(t) for the maximal depth of function applications in a term t. We write T (φ), C(φ), and F(φ) for the set of all terms, constants, and functions, in φ, respectively, where φ is either a formula or a collection of formulas. Finally, we write t[x] to mean that the term t contains x as a subterm.

For a formula φ, we write Γ |= φ if Γ *entails* φ, that is every model of Γ is also a model of φ. For any literal ℓ, we write Γ ⊢ ℓ, pronounced ℓ is *derived* from Γ, if ℓ is derivable from Γ by the usual EUF proof system PEUF . <sup>1</sup> By refutational completeness of PEUF , Γ is unsatisfable iff Γ ⊢ ⊥.

Given two EUF formulas φ<sup>1</sup> and φ<sup>2</sup> and a set of constants V ⊆ C, we say that the formulas are V -equivalent, denoted φ<sup>1</sup> ≡<sup>V</sup> φ2, if, for all quantifer free EUF formulas ψ such that C(ψ) ⊆ V , (φ<sup>1</sup> ∧ ψ) |= ⊥ if and only if (φ<sup>2</sup> ∧ ψ) |= ⊥.

Example 1 Let φ<sup>1</sup> = {x<sup>1</sup> ≈ f(a0, x0), y<sup>1</sup> ≈ f(b0, y0), x<sup>0</sup> ≈ y0}, φ<sup>2</sup> = {x<sup>1</sup> ≈ f(a0, w), y<sup>1</sup> ≈ f(b0, w)}, φ<sup>3</sup> = {x<sup>1</sup> ≈ f(a0, x0), y<sup>1</sup> ≈ f(b0, y0)}, and V = {x1, y1, a0, b0}. Then, φ<sup>1</sup> ≡<sup>V</sup> φ<sup>2</sup> but φ<sup>1</sup> ̸≡<sup>V</sup> φ3. ✷

$$\begin{aligned} \langle \langle stmt \rangle ::= \mathbf{skip} \mid \langle var \rangle ::= \langle var \rangle \mid \langle var \rangle ::= f(\langle var \rangle) \mid \mid\\ \mathbf{assume} \, (\langle cond \rangle) \mid \langle stmt \rangle \; \langle \langle stmt \rangle \; \langle \langle stmt \rangle \; \mid \\ \mathbf{if } (\langle cond \rangle) \, \mathbf{then } \langle stmt \rangle \, \mathbf{else } \langle stmt \rangle \; \mid \\ \mathbf{while } (\langle cond \rangle) \; \langle stmt \rangle \\ \langle cond \rangle ::= \langle var \rangle = \langle var \rangle \mid \langle var \rangle \neq \langle var \rangle \\ \langle var \rangle ::= \mathbf{x} \mid \mathbf{y} \mid \cdot \cdot \end{aligned}$$

Fig. 2: Syntax of the programming language UPL.

While EUF does not admit quantifer elimination, it does admit elimination of constants while preserving quantifer free consequences. Formally, a *cover* [2], [4], [5] of an EUF formula φ w.r.t. a set of constants V is an EUF formula ψ such that C(ψ) ⊆ C(φ) \ V and φ ≡<sup>C</sup>(φ)\<sup>V</sup> ψ. By [5], such ψ exists and is unique up to equivalence; we denote it by CV ·φ.

#### III. UNINTERPRETED PROGRAMS

An *uninterpreted program (UP)* is a program in the *uninterpreted programming language (UPL)*. The *syntax* of UPL is shown in Fig. 2. Let V denote a fxed set of program variables. We use lower case letters in a special font: x, y, etc. to denote individual variables in V. We write ⃗y for a list of program variables. Function symbols are taken from a fxed set F. As in [9], w.l.o.g., UPL does not allow for Boolean combination of conditionals and relational symbols.

The small step symbolic operational semantics of UPL is defned with respect to a FOL signature Σ = (C, F, {≈, ̸≈}) by the rules shown in Fig. 3. A program *confguration* is a triple ⟨s, q, pc⟩, where s, called a statement, is a UP being executed, q : V → C is a *state* mapping program variables to constants in C, and pc, called the *path condition*, is a EUF formula over Σ. We use C(q) ≜ {c | ∃v · q(v) = c} to denote the set of all constants that represent current variable assignments in q. With abuse of notation, we use C(q) and q interchangebly. We write ≡<sup>q</sup> to mean ≡<sup>C</sup>(q) .

For a state q, we write q[x ↦→ x ′ ] for a state q ′ that is identical to q, except that it maps x to x ′ . We write ⟨e, q⟩ ⇓ v to denote that v is the value of the expression e in state q, i.e., the result of substituting each program variable x in e with q(x), and replacing functions and predicates with their FOL counterparts. The value of e is an FOL term or an FOL formula over Σ. For example, ⟨x = y, [x ↦→ x, y ↦→ y]⟩ ⇓ x ≈ y.

Given two confgurations c and c ′ , we write c → c ′ if c reduces to c ′ using one of the rules in Fig. 3. Note that there is no rule for skip – the program terminates once it gets into a confguration ⟨skip, q, pc⟩.

Let C<sup>0</sup> = {v<sup>0</sup> | v ∈ V} ⊆ C be a set of initial constants. In the initial state q<sup>0</sup> of a program, every variable is mapped to the corresponding initial constant, i.e., q0(v) = v0.

The operational semantics induces, for an UP P, a transition system S<sup>P</sup> = ⟨C, c0, R⟩, where C is the set of confgurations, c<sup>0</sup> ≜ ⟨P, q0, ⊤⟩ is the initial confguration, and R ≜ {(c, c′ ) | c → c ′}. A confguration c of P is *reachable*

<sup>1</sup>Presented in our companion technical report [7].

$$\langle \mathtt{skip}; s, q, pc \rangle \to \langle s, q, pc \rangle$$

$$\frac{\langle s\_1, q, pc \rangle \to \langle s'\_1, q', pc' \rangle}{\langle s\_1; s\_2, q, pc \rangle \to \langle s'\_1; s\_2, q', pc' \rangle}$$

$$\frac{\langle c, q \rangle \Downarrow v}{\langle \mathtt{assume}(c), q, pc \rangle \to \langle \mathtt{skip}, q, pc \wedge v \rangle}$$

$$\frac{\langle e, q \rangle \Downarrow v}{\langle \mathtt{x} : e, q, pc \rangle \to \langle \mathtt{skip}, q, pc \wedge v \rangle}$$

$$\langle \mathtt{x} := e, q, pc \rangle \to \langle \mathtt{skip}, q[\mathtt{x} \mapsto x'], pc \wedge x' = v \rangle$$

$$\langle \mathtt{if} \ (c) \ \mathtt{then} \ s\_1 \mathtt{else} \ s\_2, q, pc \rangle \to \langle \mathtt{assume}(c) \ \mathtt{s}\_1, q, pc \rangle$$

$$\langle \mathtt{if} \ (c) \ \mathtt{then} \ s\_1 \mathtt{ else} \ s\_2, q, pc \rangle \to \langle \mathtt{assume}(\neg c) \ \mathtt{s}\_2, q, pc \rangle$$

$$\langle \mathtt{while} \ (c) \ \mathtt{then} \ (s) \ \mathtt{while} \ (c) \ \mathtt{while} \ (c) \ \mathtt{else} \ \mathtt{skip}, q, pc \rangle$$

$$\langle \mathtt{if} \ (c) \ \mathtt{then} \ (s) \ \mathtt{while} \ (c) \ \mathtt{while} \ (c) \ \mathtt{else} \ \mathtt{skip}, q, pc \rangle$$

Fig. 3: Small step symbolic operational semantics of UPL, where ¬c denotes x ̸= y when c is x = y, and x = y when c is x ̸= y.

if c is reachable from c<sup>0</sup> in S<sup>P</sup> . We denote the set of all reachable confgurations in S<sup>P</sup> using Reach(S<sup>P</sup> ). The set of all statements in the semantics of P, including the intermediate statements, are called *locations* of P, and are denoted by L(P). We often use P and S<sup>P</sup> interchangeably.

Our semantics of UPL differs in some respects from the one in [9]. First, we follow a more traditional small-step operational semantics presentation, by providing semantics rules and the corresponding transition system. However, this does not change the semantics conceptually. More importantly, we ensure that the path condition remains satisfable in all reachable confgurations (by only allowing an assume statement to execute when it results in a satisfable path condition). We believe this is a more natural choice that is also consistent with what is typically used in other symbolic semantics. UP reachability under our semantics coincides with the defnition of [9].

Defnition 1 (UP Reachability) Given an UP P, determine whether there exists a state q and a path condition pc s.t., the confguration ⟨skip, q, pc⟩ is reachable in P. ✷

A certifcate for unreachability of location s, is an inductive assertion map η (or an inductive invariant) s.t. η(s) = ⊥.

Defnition 2 (Inductive Assertion Map) Let Σ<sup>0</sup> ≜ (C0, F, {≈, ̸≈}), be restriction of Σ to C0. An *inductive assertion map* of an UP P, is a map η : L(P) → EUF(Σ0) s.t. (a) η(P) = ⊤, and (b) if ⟨s, q0, η(s)⟩ → ⟨s ′ , q′ , pc′ ⟩, then pc′ |= (η(s ′ )[v<sup>0</sup> ↦→ q ′ (v) | v ∈ V]). ✷

In [9], a special sub-class of UPs has been introduced with a decidable reachability problem.

Defnition 3 (Coherent Uninterpreted Program [9]) An UP P is *coherent* (CUP) if all of the reachable confgurations


Fig. 4: An example CUP program and its inductive assertions.

of P satisfy the following two properties:

Memoizing for any confguration ⟨x := f(⃗y), q, pc⟩, if there is a term t ∈ T (pc) s.t. pc |= t ≈ f(q(⃗y)), then there is v ∈ V s.t. pc |= q(v) ≈ t.

Early assume for any confguration

⟨assume(x = y), q, pc⟩, if there is a term t ∈ T (pc) s.t. pc |= t ≈ s where s is a superterm of either q(x) or q(y), then, there is v ∈ V s.t. pc |= q(v) ≈ t. ✷

Intuitively, memoization ensures that if a term is recomputed, then it is already stored in a program variable; early assumes ensures that whenever an equality between variables is assumed, any of their superterms that was ever computed is still stored in a program variable. Note that unlike the original defnition of CUP in [9], we do not require the notion of an *execution*. The path condition accumulates the history of the execution in a confguration, which is suffcient.

Example 2 An example of a CUP is shown in Fig. 4. Some reachable states in the frst iteration of the loop are shown below, where line numbers are used as locations, and pc<sup>i</sup> stands for the path condition at line i:

$$\begin{aligned} \langle 2, q\_0[\mathbf{x} \mapsto x\_1, \mathbf{y} \mapsto y\_1], x\_1 &\approx t\_0 \wedge y\_1 \approx t\_0 \rangle \\ \langle 6, q\_0[\mathbf{x} \mapsto x\_2, \mathbf{y} \mapsto y\_2, \mathbf{c} \mapsto c\_1], pc\_2 \wedge \\ c\_0 &\not\approx d\_0 \wedge x\_2 \approx n(x\_1) \wedge y\_2 \approx n(y\_1) \wedge c\_1 \approx n(c\_0) \rangle \\ \langle 9, q\_0[\mathbf{x} \mapsto x\_3, \mathbf{y} \mapsto y\_3, \mathbf{c} \mapsto c\_1] \rangle, pc\_6 \wedge \\ c\_1 &\approx d\_0 \wedge x\_3 \approx f(a\_0, x\_2) \wedge y\_3 \approx f(b\_0, y\_2) \rangle \end{aligned}$$

The program is coherent because (a) no term is recomputed; (b) for the assume at line 10, the only superterms of a<sup>0</sup> and b<sup>0</sup> are f(a0, xn) and f(b0, yn), and they are stored in x and y, respectively; and (c) for the assume (c<sup>n</sup> = d0) introduced by the exit condition of the while loop, no superterms of cn, d<sup>0</sup> are ever computed. The program does not reduce to skip (i.e., it does not reach a fnal confguration). Its inductive assertion map is shown in Fig. 4 (right). ✷

Note that UP are closely related, but are not equivalent, to the Herbrand programs of [12]. While Herbrand programs use the syntax of UPL, they are interpreted over a fxed universe of Herbrand terms. In particular, in Herbrand programs f(x) ≈ g(x) is always false (since f(x) and g(x) have different toplevel functions), while in UP, it is satisfable.

#### IV. ABSTRACTION AND BISIMULATION FOR UP

In this section, we review abstractions for transition systems. We then defne two abstraction for UP: cover and renaming, and show that they induce bisimulation. That is, for UP, these abstractions preserve all properties. Finally, we show a simple logical characterization result for UP to set the stage for our main results in the following sections.

Defnition 4 Given a transition system S = (C, c0, R) and a (possibly partial) abstraction function ♯ : C → C, the induced *abstract transition system* is ♯(S) = (C, c<sup>♯</sup> 0 , R<sup>♯</sup> ), where

$$\begin{aligned} c\_0^\sharp &\triangleq \sharp(c\_0) \\ \mathcal{R}^\sharp &\triangleq \{ (c\_\sharp, c\_\sharp') \mid \exists c, c'. \; c \to c' \land \; c\_\sharp = \sharp(c) \land c\_\sharp' = \sharp(c') \} \end{aligned}$$

We write c →<sup>♯</sup> c ′ when (c, c′ ) ∈ R<sup>♯</sup> . Note that ♯ must be defned for c0. ✷

Throughout the paper, we construct several abstract transition systems. All transition systems considered are *attentive*. Intuitively, this means that their transitions do not distinguish between confgurations that have q-equivalent path conditions. We say that two confgurations c<sup>1</sup> = ⟨s, q, pc1⟩ and c<sup>2</sup> = ⟨s, q, pc2⟩ are equivalent, denoted c<sup>1</sup> ≡ c<sup>2</sup> if pc<sup>1</sup> ≡<sup>q</sup> pc2.

Defnition 5 (Attentive TS) A transition system S = (C, c0, R) is *attentive* if for any two confgurations c1, c<sup>2</sup> ∈ C s.t. c<sup>1</sup> ≡ c2, if there exists c ′ <sup>1</sup> ∈ C s.t. (c1, c′ 1 ) ∈ R, then there exists c ′ <sup>2</sup> ∈ C, s.t. (c2, c′ 2 ) ∈ R and c ′ <sup>1</sup> ≡ c ′ 2 and vice versa. ✷

Weak, respectively strong, preservation of properties between the abstract and the concrete transition systems are ensured by the notions of *simulation*, respectively *bisimulation*.

Defnition 6 ( [11]) Let S = (C, c0, R) and ♯(S) = (C, c<sup>♯</sup> 0 , R<sup>♯</sup> ) be transition systems. A relation ρ ⊆ C × C is a *simulation* from S to ♯(S), if for every (c, c♯) ∈ ρ:

• if c → c ′ then there exists c ′ ♯ such that c<sup>♯</sup> →<sup>♯</sup> c ′ ♯ and (c ′ , c′ ♯ ) ∈ ρ.

ρ ⊆ C × C is a *bisimulation* from S to ♯(S) if ρ is a simulation from S to ♯(S) and ρ <sup>−</sup><sup>1</sup> ≜ {(c♯, c) | (c, c♯) ∈ ρ} is a simulation from ♯(S) to S. We say that ♯(S) *simulates*, respectively *is bisimilar to*, S if there exists a simulation, respectively, a bisimulation, ρ from S to ♯(S) such that (c0, c ♯ 0 ) ∈ ρ. ✷

We say that a bisimulation ρ ⊆ C × C is *fnite* if its range, {ρ(c) | c ∈ C}, is fnite. A fnite bisimulation relates a (possibly infnite) transition system with a fnite one.

Next, we defne two abstractions for UP programs and show that they result in bisimilar abstract transition systems. The frst abstraction eliminates all constants that are not assigned to program variables from the path condition, using the cover operation. The second abstraction renames the constants assigned to program variables back to the initial constants C0. Both abstractions together ensure that all reachable confgurations in the abstract transition system are defned over Σ<sup>0</sup> (i.e., the only constants that appear in states, as well as in path conditions, are from C0). There may still be infnitely many such confgurations since the depth of terms may be unbounded. We show that whenever the obtained abstract transition system has fnitely many reachable confgurations, the concrete one has an inductive assertion map that characterizes the set of reachable confgurations.

Defnition 7 (Cover abstraction) The cover abstraction function α<sup>C</sup> : C → C is defned by

$$\alpha\_{\mathbb{C}}(\langle s, q, pc \rangle) \triangleq \langle s, q, \mathbb{C}(\mathcal{C} \nwarrow \mathcal{C}(q)) \cdot pc \rangle \qquad \qquad \square$$

Since pc ≡<sup>q</sup> C(C \ C(q))· pc, the cover abstraction also results in a bisimilar abstract transition system.

Theorem 1 *For any attentive transition system* S = (C, c0, R)*, the relation* ρ = {(c, αC(c)) | c ∈ Reach(S)} *is a bisimulation from* S *to* αC(S)*.* ✷

To introduce the renaming abstraction, we need some notation. Given a quantifer free formula φ, constants a, b ∈ C(φ) such that a ̸= b, let φ[a ↣ b] denote φ[b ↦→ x][a ↦→ b], where x is a constant not in C(φ). For example, if φ = (a ≈ c∧b ≈ d), φ[a ↣ b] = (b ≈ c ∧ x ≈ d).

Given a path condition pc and a state q, let r0(pc, q) denote the formula obtained by renaming all constants in C(q) using their initial values. r0(pc, q) = pc[q(v) ↣ v0] for all v ∈ V such that q(v) ̸= v0.

Defnition 8 (Renaming abstraction) The renaming abstraction function α<sup>r</sup> : C → C is defned by

$$
\alpha\_r(\langle s, q, pc \rangle) \stackrel{\Delta}{=} \langle s, q\_0, r\_0(pc, q) \rangle \tag{7}
$$

Theorem 2 *For any attentive transition system* S = (C, c0, R)*, the relation* ρ = {(c, αr(c)) | c ∈ Reach(S)} *is a bisimulation from* S *to* α<sup>r</sup> (S)*.* ✷

Finally, we denote by α<sup>C</sup>,<sup>r</sup> the composition of the renaming and cover abstractions: α<sup>C</sup>,<sup>r</sup> ≜ α<sup>C</sup> ◦ α<sup>r</sup> (i.e., α<sup>C</sup>,<sup>r</sup> (c) = α<sup>r</sup> (αC(c))). Since the composition of bisimulation relations is also a bisimulation, α<sup>C</sup>,<sup>r</sup> (S) is bisimilar to S.

Theorem 3 (Logical Characterization of UP) *If* α<sup>C</sup>,<sup>r</sup> *induces a fnite bisimulation on an UP* P*, then, there exists an inductive assertion map* η *for* P *that characterizes the reachable confgurations of* P*.* ✷

PROOF Defne η(s) ≜ ⋁ {pc | ⟨s, q, pc⟩ ∈ Reach(α<sup>C</sup>,<sup>r</sup> (P))}. Then, η(s) is such an inductive assertion map. ■

Intuitively, Thm. 3 says that inductive invariant of UP, whenever it exists, can be described using EUF formulas over program variables. That is, any extra variables that are added to the path condition during program execution can be abstracted away (specifcally, using the cover abstraction). There are, of course, infnitely many such invariants since the depth of terms is not bounded (only constants occurring in them). In the sequel, we systematically construct a similar result for CUP.

#### V. BISMULATION OF CUP

The frst step in extending Thm. 3 to CUP is to design an abstraction function that bounds the depth of terms that appear in any reachable (abstract) state. It is easy to design such a function while maintaining soundness – simply forget literals that have terms that are too deep. However, we want to maintain precision as well. That is, we want the abstract transition system to be bisimilar to the concrete one. Just like cover abstraction, the base abstraction function also eliminates all constants that are not assigned to program variables. Unlike cover abstraction, the base abstraction does not maintain C(q) equivalence of the path conditions, but, rather, forgets most literals that cannot be expressed over program variables.

In this section, we focus on the defnition of the base abstraction and prove that it induces bisimulation for CUP. This result is used in Sec. VI, to logically characterize CUPs.

Intuitively, the base abstraction "truncates" the congruence graph induced by a path condition in nodes that have no representative in the set of constants assigned to the program variables (V in the following defnition), and assigns to the truncated nodes fresh constants (from W in the following defnition).

Congruence closure procedures for EUF use a *congruence graph* to concisely represent the deductive closure of a set of EUF literals [15], [16]. Here, we use a logical characterization of a congruence graph, called a V *-basis*. Let Γ be a set of EUF literals. A triple ⟨W, β, δ⟩ is a V -basis of Γ relative to a set of constants V , written ⟨W, β, δ⟩ ∈ base(Γ, V ), iff (a) W is a set of fresh constants not in C(Γ), and β and δ are conjunctions of EUF literals; (b) (∃W · β ∧ δ) ≡ Γ; (c) β ≜ β<sup>≈</sup> ∪ β̸≈ ∪ β<sup>F</sup> and δ ≜ δ<sup>≈</sup> ∪ δ̸≈ ∪ δ<sup>F</sup> , where

$$\begin{aligned} \beta\_{\approx} & \subseteq \{ u \approx v \mid u, v \in V \} & \beta\_{\not\equiv} & \subseteq \{ u \not\le v \mid u, v \in V \} \\ \beta\_{\mathcal{F}} & \subseteq \{ v \approx f(\vec{w}) \mid v \in V, \vec{w} \subseteq V \cup W, \vec{w} \cap V \neq \emptyset \} \\ \delta\_{\approx} & \subseteq \{ w \approx u \mid w \in V \cup W, u \notin V \cup W \} \\ \delta\_{\not\equiv} & \subseteq \{ u \not\not\le w \mid u \in W, w \in W \cup V \} \\ \delta\_{\mathcal{F}} & \subseteq \{ v \approx f(\vec{w}) \mid v, \vec{w} \subseteq V \cup W, v \in V \Rightarrow \vec{w} \subseteq W \} \end{aligned}$$

(d) β ∧ δ ⊬ v ≈ w for any v ∈ V , w ∈ W; and (e) β ∧ δ ⊬ w<sup>1</sup> ≈ w<sup>2</sup> for any w1, w<sup>2</sup> ∈ W s.t. w<sup>1</sup> ̸= w2.

Note that we represent both equalities and disequalities in the V -basis as common in implementations (but not in the theoretical presentations) of the congruence closure algorithm. Intuitively, V are constants in C(Γ) that represent equivalence classes in Γ, and W are constants added to represent equivalence classes that do not have a representative in V . A V -basis, of any satisfable set Γ, is unique up to renaming of constants in W and ordering of equalities between constants in V .

Example 3 Let Γ = {x ≈ f(a, v1), y ≈ f(b, v2), v<sup>1</sup> ≈ v2} and V = {a, b, x, y}. A V -basis of Γ is ⟨W, β, δ⟩, where W = {w}, β = {x ≈ f(a, w), y ≈ f(b, w)}, δ = {w ≈ v1, w ≈ v2}. Renaming w to w ′ is a different V -basis: ⟨W′ , β′ , δ′ ⟩ ∈ base(Γ, V ) where W′ = {w ′}, β ′ = β[w ↦→ w ′ ] and δ ′ = δ[w ↦→ w ′ ].

As another example, consider Γ = {x ≈ f(a, p), x ≈ f(a, n(p)), y = f(b, p), y = f(c, n(p))} and V = {x, y, a, b, c}. A V -basis of Γ is ⟨W, β, δ⟩, where W = {w0, w1}, δ<sup>2</sup> = {w<sup>0</sup> ≈ p, w<sup>1</sup> ≈ n(w0)}, and

$$\beta\_2 = \begin{cases} x \approx f(a, w\_0) & x \approx f(a, w\_1) \\ y \approx f(b, w\_0) & y \approx f(c, w\_1) \end{cases} \tag{7}$$

While a basis maintains all consequences of Γ (since (∃W · β ∧ δ) ≡ Γ), the V -base abstraction of Γ, defned next, is weaker. It preserves consequences of β only:

Defnition 9 (V -base abstraction) The V -base abstraction α<sup>V</sup> for a set of constants V , is a function between sets of literals s.t. for any sets of literals Γ and Γ ′ :


The second requirement of Def. 9 ensures that two formulas that have the same V -consequences, have the same V abstraction. For example, for a set of constants V = {u, v}, the formulas φ<sup>1</sup> = {v ≈ f(u, x)} and φ<sup>2</sup> = {v ≈ f(u, y)}, have the same V -base abstraction: v ≈ f(u, w). Note that at this point, we only require that α<sup>V</sup> is well defned (for example, it does not have to be computable.)

We now extend V -base abstraction to program confguration, calling it simply *base abstraction*, since the set of preserved constants is determined by the confguration:

Defnition 10 (Base abstraction) The base abstraction α<sup>b</sup> : C → C is defned for confgurations ⟨s, q, pc⟩ ∈ C, where pc is a *conjunction* of literals: αb(⟨s, q, pc⟩) ≜ ⟨s, q, α<sup>C</sup>(q)(pc)⟩.✷

Namely, the base abstraction α<sup>C</sup>(q) applied to the path condition is determined by the state q in the confguration. We often write αq(φ) as a shorthand for α<sup>C</sup>(q)(φ).

We are now in position to state the main result of this section. Given a CUP P, the abstract transition system αb(S<sup>P</sup> ) = (C, c<sup>α</sup><sup>b</sup> 0 , R<sup>α</sup><sup>b</sup> ) is bisimilar to the concrete transition system S<sup>P</sup> = (C, c0, R). Note that at this point, we do not claim that αb(S<sup>P</sup> ) is fnite, or that it is computable. We focus only on the fact that the literals that are forgotten by the base abstraction do not matter for any future transitions. The key technical step is summarized in the following theorem:

Theorem 4 *Let* ⟨s, q, pc⟩ *be a reachable confguration of a CUP* P*. Then,*

$$\begin{array}{ll}(I) & \langle s, q, pc \rangle \rightarrow \langle s', q', pc \wedge pc' \rangle \text{ iff }\\ & \langle s, q, \alpha\_q(pc) \rangle \rightarrow \langle s', q', \alpha\_q(pc) \wedge pc' \rangle, \text{ and} \\\ & (2) & \alpha\_{q'}(pc \wedge pc') = \alpha\_{q'}(\alpha\_q(pc) \wedge pc'). \end{array}$$

The proof of Thm. 4 is not complicated, but it is tedious and technical. It depends on many basic properties of EUF. We summarize the key results that we require in the following lemmas. The proofs of the lemmas are provided in our companion technical report [7].

We begin by defning a *purifer* – a set of constants suffcient to represent a set of EUF literals with terms of depth one.

Defnition 11 (Purifer) We say that a set of constants V is a *purifer* of a constant a in a set of literals Γ, if a ∈ V and for every term t ∈ T (Γ) s.t. Γ ⊢ t ≈ s[a], ∃v ∈ V s.t. Γ ⊢ v ≈ t.✷

For example, if Γ = {c ≈ f(a), d ≈ f(b), d ̸≈ e}. Then, V = {a, b, c} is a purifer for a, but not a purifer for b, even though b ∈ V .

In all the following lemmas, Γ, φ1, φ<sup>2</sup> are sets of literals; V a set constants; a, b ∈ C(Γ); u, v, x, y ∈ V ; V is a purifer for {x, y} in Γ, φ1, and in φ2; β = α<sup>V</sup> (Γ); and α<sup>V</sup> (φ1) = α<sup>V</sup> (φ2).

Lemma 1 says that anything newly derivable from Γ and a new equality a ≈ b is derivable using superterms of a and b: Lemma 1 *Let* t<sup>1</sup> *and* t<sup>2</sup> *be two terms in* T (Σ) *s.t.* Γ ̸⊢ (t<sup>1</sup> ≈ t2)*. Then,* (Γ∧a ≈ b) ⊢ (t<sup>1</sup> ≈ t2)*, for some constants* a *and* b *in* C(Γ)*, iff there are two superterms,* s1[a] *and* s2[b]*, of* a *and* b*, respectively, s.t. (i)* Γ ⊢ (t<sup>1</sup> ≈ s1[a])*, (ii)* Γ ⊢ (t<sup>2</sup> ≈ s2[b])*, and (iii)* (Γ ∧ a ≈ b) ⊢ (s1[a] ≈ s2[b])*.*

Lemma 2 and Lemma 3 say that all consequences of Γ that are relevant to V are present in β = α<sup>V</sup> (Γ) as well.

Lemma 2 (Γ ∧ x ≈ y ⊢ u ≈ v) ⇐⇒ (β ∧ x ≈ y ⊢ u ≈ v)*.* Lemma 3 (Γ ∧ x ≈ y ⊢ u ̸≈ v) ⇐⇒ (β ∧ x ≈ y ⊢ u ̸≈ v)*.* Lemma 4 says that β = α<sup>V</sup> (Γ) can be described using terms of depth one using constants in V .

Lemma 4 V *is a purifer for* x ∈ V *in* β*.*

Lemma 5 says that α<sup>V</sup> is idempotent.

Lemma 5 α<sup>V</sup> (Γ) = α<sup>V</sup> (α<sup>V</sup> (Γ))*.*

Lemma 6 and Lemma 7 say that α<sup>V</sup> preserves addition of new literals and dropping of constants.

Lemma 6 α<sup>V</sup> (φ<sup>1</sup> ∧ x ≈ y) = α<sup>V</sup> (φ<sup>2</sup> ∧ x ≈ y)*.* Lemma 7 *If* U ⊆ V *, then*

$$(\alpha\_V(\varphi\_1) = \alpha\_V(\varphi\_2)) \Rightarrow (\alpha\_U(\varphi\_1) = \alpha\_U(\varphi\_2))$$

Lemma 8 extends the preservation results to disequalities. V is a set of constants, x, y ∈ V . V is not required to be a purifer (as it was in the previous lemmas).

Lemma 8 α<sup>V</sup> (φ<sup>1</sup> ∧ x ̸≈ y) = α<sup>V</sup> (φ<sup>2</sup> ∧ x ̸≈ y)*.*

Lemma 9 extends the preservation results for equalities involving a fresh constant x ′ s.t. x ′ ̸∈ C(φ1) ∪ C(φ2). y⃗ ⊆ V , V ′ = V ∪ {x ′}, and f(y⃗) be a term s.t there does not exists a term t ∈ T (φ1)∪ T (φ2) s.t. φ<sup>1</sup> ⊢ t ≈ f(y⃗) or φ<sup>2</sup> ⊢ t ≈ f(y⃗). Lemma 9

$$
\alpha\_{V'}(\varphi\_1 \wedge x' \approx y) = \alpha\_{V'}(\varphi\_2 \wedge x' \approx y) \tag{1}
$$

$$\alpha\_{V'}(\varphi\_1 \wedge x' \approx f(\vec{y})) = \alpha\_{V'}(\varphi\_2 \wedge x' \approx f(\vec{y})) \tag{2}$$

We are now ready to present the proof of Thm. 4:

PROOF (THEOREM 4) In the proof, we use x = q(x), and y = q(y). For part (1), we only show the proof for s = assume(x ▷◁ y) since the other cases are trivial.

The only-if direction follows since αq(pc) is weaker than pc. For the if direction, pc ̸⊢ ⊥ since it is part of a reachable confguration. Then, there are two cases:

• case s = assume(x = y). Assume (pc ∧ x ≈ y) |= ⊥. Then, (pc ∧ x ≈ y) ⊢ t<sup>1</sup> ≈ t<sup>2</sup> and pc ⊢ t<sup>1</sup> ̸≈ t<sup>2</sup> for some t1, t<sup>2</sup> ∈ T (pc). By Lemma 1, in any new equality (t<sup>1</sup> ≈ t2) that is implied by pc∧(x ≈ y) (but not by pc), t<sup>1</sup> and t<sup>2</sup> are equivalent (in pc) to superterms of x or y. By the early assume property of CUP, C(q) purifes {x, y} in pc. Therefore, every superterm of x or y is equivalent (in pc) to some constant in C(q). Thus, (pc∧x ≈ y) ⊢ u ≈ v and (pc ∧ x ≈ y) ⊢ u ̸≈ v for some u, v ∈ C(q). By Lemma 2, (αq(pc) ∧ x ≈ y) ⊢ u ≈ v. By Lemma 3, (αq(pc) ∧x ≈ y) ⊢ u ̸≈ v. Thus, (αq(pc) ∧x ≈ y) |= ⊥.

• case s = assume(x ̸= y). (pc ∧ x ̸≈ y) |= ⊥ if and only if pc ⊢ x ≈ y. Since x, y ∈ C(q), αq(pc) ⊢ x ≈ y. ■

For part (2), we only show the cases for assume and assignment statements, the other cases are trivial.


Corollary 1 *For a CUP* P*, the relation* ρ ≜ {(c, αb(c)) | c ∈ Reach(S<sup>P</sup> )} *is a bisimulation from* S<sup>P</sup> *to* αb(S<sup>P</sup> )*.* ✷

Note that for an arbitrary UP, α<sup>b</sup> induces a simulation (since α<sup>b</sup> only weakens path conditions).

By construction, for any confguration in an abstract system constructed using αb, the path condition will be at most depth-1. In Sec. VI, we use this property to build a logical characterization of CUP and show that reachability of CUP programs is decidable.

#### VI. LOGICAL CHARACTERIZATION OF CUP

In this section, we show that for any CUP program P, all reachable confgurations of P can be characterized using formulas in EUF, whose size is bounded by the number of program variables in P.

Theorem 5 (Logical Characterization of CUP) *For any CUP* P*, there exists an inductive assertion map* η*, ranging over EUF formulas of depth at most 1, that characterizes the reachable confgurations of* P*.* ✷

The frst step in the proof is to compose the renaming abstraction (Def. 8) with the base abstraction (Def. 10). We denote the composition with αb,<sup>r</sup> , i.e., αb,<sup>r</sup> ≜ α<sup>b</sup> ◦ α<sup>r</sup> . Cor. 1 and Thm. 2 ensures that αb,<sup>r</sup> is sound and complete for CUP. We split the rest of the proof into two cases: CUPs restricted to unary functions, called 1-CUP, followed by arbitrary CUPs.

PROOF (THM. 5, 1-CUP) Let Σ <sup>1</sup> be a signature containing function symbols of arity atmost 1, Σ <sup>1</sup> ≜ (C, F 1 , {≈, ̸≈}). Let Γ be a set of literals in Σ 1 and V be a set of constants. By the defnition of V -base abstraction (Def. 9), α<sup>V</sup> (Γ) = β<sup>≈</sup> ∧β̸≈ ∧β<sup>F</sup> . β<sup>≈</sup> and β̸≈ are over constants in V . β<sup>F</sup> contains two types of literals: β<sup>F</sup><sup>V</sup> and β<sup>F</sup><sup>W</sup> . β<sup>F</sup><sup>V</sup> are 1 depth literals over constants in V . β<sup>F</sup><sup>W</sup> are literals of the form v ≈ f(w⃗ ) where v ∈ V and w⃗ is a list of constants, at least one of which is in V : w⃗ ∩V ̸= ∅ and w⃗ ̸⊆ V . Since Γ can only have unary functions, β<sup>F</sup><sup>W</sup> = ∅. Therefore, all literals in α<sup>V</sup> (Γ) are of depth at most 1 and only contain constants from V . Hence, there are only fnitely many confgurations in αb,<sup>r</sup> (S<sup>P</sup> ). Therefore,

$$\eta(s) \triangleq \bigvee \{ pc \mid \langle s, q\_0, pc \rangle \in \operatorname{Reach}(\alpha\_{b,r}(\mathcal{S}\_P)) \}$$

is an inductive assertion map, ranging over formulas for depth at most 1, that characterizes the reachable confgurations of P. Moreover, the size of each disjunct in η(s) is polynomial in the number of program variables and functions in P. ■

An interesting consequence of the above proof is that, for 1- CUPs, α<sup>b</sup> is effciently computable (since, β<sup>F</sup><sup>W</sup> = ∅). Thus, the transition system αb,<sup>r</sup> (S<sup>P</sup> ) is fnite, and can be constructed on-the-fy. Hence, reachability of 1-CUP is in PSPACE.

PROOF (THM. 5, GENERAL CASE) In general, CUP programs can contain unary and non-unary functions. Therefore, the V -base abstraction (Def. 9) may introduce fresh constants. We use the cover abstraction (Def. 7) to eliminate these fresh constants. By Thm. 1, αC(αb,<sup>r</sup> (S<sup>P</sup> )) is bisimilar to αb,<sup>r</sup> (S<sup>P</sup> ). Notice that all the fresh constants introduced by the V -base abstraction are arguments to function applications. Therefore, all consequences of eliminating the fresh constants are Horn clauses of the form ⋀ i (x<sup>i</sup> ≈ yi) ⇒ x ≈ y, where xi , y<sup>i</sup> , x, y ∈ C0. Since V -basis is of depth at most 1, cover of the V -basis is also of depth at most 1. Since there are only fnitely many formulas of depth at most 1 over C0, αC(αb,<sup>r</sup> (S<sup>P</sup> )) has only fnitely many confgurations. Hence,

$$\eta(s) \triangleq \bigvee \{ pc \mid \langle s, q\_0, pc \rangle \in \operatorname{Reach}(\alpha\_{\mathbb{C}}(\alpha\_{b,r}(\mathcal{S}\_P))) \}$$

is an inductive assertion map that characterizes the reachable confgurations of P and ranges over depth-1 formulas. ■

Consider the CUP shown in Fig. 4. At line 9, the αb,<sup>r</sup> abstraction produces the following abstract pc: x<sup>0</sup> ≈ f(a0, w)∧y<sup>0</sup> ≈ f(b0, w) ∧ c<sup>0</sup> ≈ d0. Using cover to eliminate the constant w gives us Cw · pc = (a<sup>0</sup> ≈ b<sup>0</sup> ⇒ x<sup>0</sup> ≈ y0) ∧ c<sup>0</sup> ≈ d0, which is exactly the invariant assertion mapping η(9) at line 9.

We have seen that all CUP programs have an inductive assertion map that characterizes their reachable confgurations and ranges over a fnite set of formulas. Therefore,

### Corollary 2 *CUP reachability is decidable.* ✷

### *A. Relationship to [9]*

In [9], Cor. 2 is proven by constructing a deterministic fnite automaton that accepts all *feasible* coherent executions.<sup>2</sup> However, the construction fails for the executions of the CUP in Fig. 4: the execution that reaches a terminal confguration is infeasible, but it is (wrongfully) accepted by the automaton. Intuitively, the reason is that the automaton is deterministic and its states are not suffciently expressive. The states of the automaton keep track of equalities between program variables (which correspond to β<sup>≈</sup> in our abstraction), disequalities between them (β̸≈ in our case), and partial function interpretations (β<sup>F</sup> ). However, the partial function interpretations are restricted to β<sup>F</sup><sup>V</sup> , i.e., do not allow auxiliary constants that are not assigned to program variables. Thus, they are unable to keep track of x<sup>0</sup> ≈ f(a0, w)∧y<sup>0</sup> ≈ f(b0, w)∧c<sup>0</sup> ≈ d<sup>0</sup> in line 9, which is essential for showing infeasibility of the execution. Eliminating the auxiliary constants, as we do in the cover abstraction, does not remedy the situation since it introduces a disjunction (a<sup>0</sup> ̸≈ b<sup>0</sup> ∧ c<sup>0</sup> ≈ d0) ∨ (x<sup>0</sup> ≈ y<sup>0</sup> ∧ c<sup>0</sup> ≈ d0), which the deterministic automaton does not capture.

#### *B. Computing a Finite Abstraction*

We have shown that CUP programs are bisimilar to fnite state systems. However, all our proofs depend on αb, which was not assumed to be computable. In this section, we show how to implement αb, and, thereby, show how to compute a fnite state system that is bisimilar to a CUP program. Note that our prior results are independent of this section.

The main diffculty is in naming the fresh constants, which we always refer to as W, that are introduced by the base abstraction. Since we require that base abstraction is canonical, the naming has to be unique. Furthermore, we have to show that the number of such W constants is bounded. We solve both of these problems by proposing a deterministic naming scheme. The scheme is determined by a normalization function n<sup>V</sup> that replaces all the fresh constants in a V -basis with canonical constants.

Let β be a V -basis. We denote the auxiliary constants in β (C(β) \ V ) by W = {w0, w1, . . .}, and by '?' some unused constant that we call a *hole*. Recall that constants from W may only appear in literals of the form v ≈ f(w⃗ ). We defne

<sup>2</sup> In our setting, feasible coherent executions correspond to paths in the transition system of any CUP.

the set of W-templates as the set of all terms f(a⃗), where each element in a⃗ is either a hole or a constant in W. A term t *matches* a template f(a⃗) if t = f(b ⃗), and a⃗ and b ⃗ agree on all constants in W. For example, let ξ be the template f(? , w1, ? , w2). The term f(a, w1, b, w2) matches ξ, but f(w0, w1, b, w2) does not, because one of the holes is flled with w<sup>0</sup> ∈ W. We say that a literal v ≈ f(b ⃗) matches a template ξ if f(b ⃗) matches ξ. The W-context of a W-template ξ in a set of literals L, denoted ZL(ξ), is the set ZL(ξ) ≜ {ℓ[W ↦→? ] | ℓ ∈ L ∧ ℓ matches ξ}, where ℓ[W ↦→? ] means that all occurrences of constants in W are replaced with a hole. For example, let ξ = f(? , w1, w2, ? ) and L = {v ≈ f(a, w1, w2, b), u ≈ f(c, w1, w2, a), w ≈ f(x, w1, w2, b), x ≈ g(x, w1, w2, b))} then ZL(ξ) = {v ≈ f(a, ? , ? , b), u ≈ f(c, ? , ? , a), w ≈ f(x, ? , ? , b)}.

Since V and F are fnite, the number of W-contexts is fnite, independent of W. Let w<sup>Z</sup> be a fresh constant for context Z.

Defnition 12 (Normalization Function) The normalization function n<sup>V</sup> (β) is defned as follows:


The normalization preserves V -equivalence of β because it renames local constants, while maintaining all consequences that are derivable through them. That is, n<sup>V</sup> (β) ≡<sup>V</sup> β. Furthermore, n<sup>V</sup> (β) is cannonical.

Therefore, given a set of literals Γ, we use n<sup>V</sup> (β) as a computable implementation of the V -base abstraction, α<sup>V</sup> (Def. 9). That is, α<sup>V</sup> (Γ) ≜ n<sup>V</sup> (β) where ⟨W, β, δ⟩ ∈ base(Γ, V ). Even though n<sup>V</sup> (β) may not be a part of a V -basis for Γ, it satisfes all the properties used in the proof of Thm. 4.

We defne the normalizing abstraction in the usual way:

Defnition 13 (Normalizing abstraction) The normalizing abstraction function α<sup>n</sup> : C → C is defned by

$$
\alpha\_n(\langle s, q\_0, pc \rangle) \stackrel{\Delta}{=} \langle s, q\_0, n(pc) \rangle \qquad \qquad \square
$$

Let αb,r,n ≜ α<sup>b</sup> ◦ α<sup>r</sup> ◦ α<sup>n</sup> be the composition of normalization abstraction with renaming and base abstraction where α<sup>b</sup> is implemented using normalization. Notice that, for any state c = ⟨s, q, pc⟩, αb,r,n(c) is computed by frst computing *any* V -basis of pc, applying nq, renaming all C(q) constants to q0, and applying n<sup>q</sup><sup>0</sup> . The second normalization is required to ensure that the fresh constants are canonical with respect to q0. By defnition αb,r,n is computable. Hence, it can be used to compute the fnite abstraction of any CUP.

Theorem 6 *For a CUP* P*, the fnite abstract transition system* αb ′ ,r,n(S<sup>P</sup> ) *is bisimilar to* P *and is computable.* ✷

Thm. 6 implies that any property that is decidable over a fnite transition system is also decidable over CUPs. In particular, temporal logic model checking is decidable.

#### VII. CONCLUSION

In this paper, we study theoretical properties of Coherent Uninterpreted Programs (CUPs) that have been recently proposed by Mathur et al. [9]. We identify a bug in the original paper, and provide an alternative proof of decidability of the reachability problem for CUP. More signifcantly, we provide a logical characterization of CUP. First, we show that inductive invariant of CUP is describable by shallow formulas. Hence, the set of all candidate invariants can be effectively enumerated. Second, we show that CUPs are bisimilar to fnite transition systems. Thus, while they are formally infnite state, they are not any more expressive than a fnite state system. Third, we propose an algorithm to compute a fnite transition system of a CUP. This lifts all existing results on fnite state model checking to CUPs.

In the paper, we have focused on the core result of Mathur et al, and have left out several interesting extensions. In [9], the notion of CUP is extended with k-coherence – a UP P is k-coherent if it is possible to transform P into a CUP Pˆ by adding k *ghost* variables to P. This is an interesting extension since it makes potentially many more programs amenable to decidable verifcation. We observe that addition of *ghost* variables is a form of abstraction. Thus, invariants of Pˆ can be translated to invariants of P using techniques of Namjoshi et al. [13], [14]. This essentially amounts to existentially eliminating ghost variables from the invariant of Pˆ . Such elimination increases the depth of terms in the invariant at most by one for each variable eliminated. Thus, we conjecture that k-coherent programs are characterized by invariants with terms of depth at most k.

Mathur et al. [9] extend their results to recursive UP programs (i.e., UP programs with recursive procedures). We believe our logical characterization results extend to this setting as well. In this case, both the invariants and procedure summaries (i.e., procedure pre- and post-conditions) are described using terms of depth at most 1.

Our results also hold when CUPs are extended with simple axiom schemes, as in [10], while for most non-trivial axiom schemes CUPs become undecidable.

Perhaps most interestingly, our results suggest effcient verifcation algorithms for CUPs and interesting abstraction for UPs. Since the space of invariant candidates is fnite, it can be enumerated, for example, using implicit predicate abstraction. For CUPs, this is a complete verifcation method. For UPs it is an abstraction. Most importantly, it does not require prior knowledge to whether an UP is a CUP!

*Acknowledgment:* The research leading to these results has received funding from the European Research Council under the European Union's Horizon 2020 research and innovation programme (grant agreement No [759102-SVIS]). This research was partially supported by the United States-Israel Binational Science Foundation (BSF) grant No. 2016260, and the Israeli Science Foundation (ISF) grant No. 1810/18. We also acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

#### REFERENCES


*International Conference, VMCAI 2003, New York, NY, USA, January 9-11, 2002, Proceedings*, ser. Lecture Notes in Computer Science, L. D. Zuck, P. C. Attie, A. Cortesi, and S. Mukhopadhyay, Eds., vol. 2575. Springer, 2003, pp. 174–188. [Online]. Available: https://doi.org/10.1007/3-540-36384-X 16


# Data-driven Optimization of Inductive Generalization

Nham Le University of Waterloo nv3le@uwaterloo.ca

Xujie Si McGill University CIFAR AI Chair, Mila xsi@cs.mcgill.ca

Arie Gurfinkel University of Waterloo arie.gurfinkel@uwaterloo.ca

*Abstract*—Inductive generalization (IG) is the key to the efficiency of modern Symbolic Model Checkers (SMCs). In this paper, we introduce a *data-driven* method for inductive generalization, whose performance can be automatically improved through historical runs over similar instances. Our method is inspired by recent advances for the part-of-speech (PoS) tagging problem in natural language processing (NLP). Specifically, we use a hierarchical recurrent neural network augmented with syntactic and semantic information to predict essential parts of a proof obligation that could be generalized, instead of checking each part one by one. We develop a prototype called ROPEY by incorporating our method into SPACER – a state-of-the-art SMC, and perform evaluations on the KIND2's simulation benchmarks. ROPEY is evaluated in two settings: *online learning* – for a given instance, we run SPACER for a number of iterations and collect its trace on which ROPEY is trained, and then use ROPEY to guide SPACER to finish the remaining solving process; and *transfer learning* – ROPEY is trained over historical runs of SPACER in advance, and for future instances, ROPEY is used directly to guide SPACER from the very beginning. For non-trivial benchmarks, ROPEY perfectly answers 72% and 77% of the queries in the online and transfer learning settings, respectively. While the speed improvement is not the focus of the paper, our preliminary results are promising: for non-trivial instances, ROPEY's end-toend running time is 25% faster.

### I. INTRODUCTION

Model checking has been widely used in various important areas such as robustness analysis of deep neural networks [27], verification of hardware designs [16], software verification [3], analysis [20] and testing [41], parameter synthesis in biology [5], and many others. The central challenge of model checking is to find a concise and sound approximation of all possible states a given system may reach, which does not cover any undesired states (i.e. violating given specifications). Tremendous progress has been made by innovations in efficient data representations [10], scalable SAT solvers [43], [35], [18], and effective heuristics [14], [13], [32]. Modern model checkers share a common basis, namely, IC3 [7], of which the key insight is *inductive generalization (IG)*. This idea has been generalized to support rich theories [26] that are crucial for many verification tasks [30], [22] beyond hardware verification. The generalized IC3 with rich theories, also known as satisfiability checking for Constrained Horn

This work was supported, in part, by an Individual Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and the Canada CIFAR AI Chair Program.

Clauses modulo Theory (CHC) [6], becomes the core part of a broad range of verification tasks.

Existing IG techniques follow either an enumerative search process [7], [8] or ad-hoc heuristics [21], [31]. These heuristics are effective but demand non-trivial domain-specific (or even problem-specific) expertise. In this work, we aim to learn such heuristics automatically from the past successful IGs. We observe that verification problems as well as associated IGs are not isolated from each other. Taking software verification as an example, verifying different properties of the same program involves similar or same IGs; different versions of programs have a similar code base; and different software may use the same conventions, idioms, libraries and frameworks, resulting in similar structures.

Our approach is inspired by recent advances in deep learning, especially in NLP where non-trivial semantic correlations between words are learned automatically using Neural Networks (NNs) [33]. However, IG raises many new challenges for deep learning. First, the input and the output of IG are symbolic expressions, which are *highly structured* with *rich semantics*. Slight syntactic variations can lead to dramatic changes in semantics. Second, more importantly, given that neural networks hardly provide any reliable guarantees, how to design a data-driven system based on deep neural networks, which exhibits *learnability* from past experiences but still preserves *soundness*? All these challenges have to be properly addressed in building a data-driven reasoning framework. In this work, we share our design choices and empirical findings in building a data-driven inductive generalization engine ROPEY, which introduces a neural component into SMC. Specifically, we make the following contributions:


https://doi.org/10.34727/2021/isbn.978-3-85448-046-4 <sup>17</sup> This article is licensed under a Creative Commons Attribution 4.0 International License

Fig. 1: Literal co-occurrences in solving PRODUCER\_CONSUMMER\_luke\_2\_e7\_1068\_e8\_1019.

queries, and this predictive power directly translates to an improvement in end-to-end running time.

The utility of our current solution is modest since its applications are restricted to two use-cases: verification of *multiple* properties of a *single* system (transfer learning), and guiding verification of a hard property using its partial run (online learning). This, however, is already useful in the context of multi-property verification that is common both in hardware and software verification domain [12]. More importantly, we demonstrate that NN-based heuristics can be effective in IC3 style algorithms. We believe this will lead to many further improvements, including heuristics that will eventually transfer between systems.

The rest of the paper is structured as follows. Sec. II shows a motivating example. Sec. III gives an overview of our approach. Sec. IV describes two novel embedding methods for converting symbolic expressions into numerical vectors. Sec. V formalizes the learning problem and describes our neural network architecture. Sec. VI presents our empirical evaluation and ablation study. Finally, Sec. VII discusses closely related work, and Sec. VIII concludes the paper.

#### II. A MOTIVATING EXAMPLE

In this section, we motivate our approach by illustrating the solving process of a particular CHC problem – the variant e7\_1068\_e8\_1019 of the problem PRODUCER\_CONSUMMER\_luke\_2 from KIND2 [11] benchmarks. We identify a bottle neck in IG, observe a pattern in the solving process, and explain how this leads to our intuition. While we use a specific instance for illustration, the results generalize to others in our benchmarks. We assume familiarity with SMC [15] and inductive generalization of IC3 [7]. These are also summarized in Sec. III.

SPACER cannot solve this variant in less than 930s. SPACER proves that the instance is safe up to depth 29 in 883s, in which 545s (61%) is spent on IG – so this is the bottleneck.

During inductive generalization process, SPACER takes a candidate lemma L, and uses an SMT solver to check whether each literal of L can be dropped. Each call to the SMT solver is potentially very costly. Thus, it is desirable to drop or skip multiple literals together.

We conjecture that there is a pattern between literals: some groups of literals may always be dropped or kept together. If this correlation is known, it can be used to speed up IG.

Fig. 2: Overview of Symbolic Model Checking and ROPEY.

To verify our hypothesis, in Fig. 1 we visualize the cooccurrences of kept literals in the instance. Literals are ordered by the time they are learned. Each cell Xij in the grid is the number of times the literals `<sup>i</sup> and `<sup>j</sup> appear together in some generalized lemma (normalized by the largest value). In the figure, brighter cells indicate larger values.

The figure shows a strong geometric pattern, with literals clustered into unusual groups. However, we are not able to tell the exact heuristics describing those patterns. In this paper, we turn this observation into a practical inductive generalization method with the help of data-driven approach.

#### III. OVERVIEW

In this section, we give an overview of our technique, outline the challenges involved, and our key insights to address them. The context is symbolic SMT-based Model Checking (SMC) [7], [26], [29], also known as satisfiability checking for Constrained Horn Clauses modulo Theory (CHC) [6]. In Model Checking, the high-level goal is to show that an infinite state transition system (T r) does not have an execution/path that reaches a set of bad states (Bad) by finding a formula Inv that is an inductive invariant of Tr and does not intersect with Bad. The goal of CHC solving is to show that a set of First Order Logic formulas Φ that satisfy the Horn restriction [6] is satisfiable by exhibiting a symbolic formula Model that defines an FOL model that satisfies Φ. The two problems are closely related. Model Checking is often reduced to CHC solving. Both problems are in general undecidable.

Fig. 2a shows the basic structure of an SMC algorithm based on IC3 architecture. In the paper, we use SMC SPACER [29], but the architecture is common to many engines. SMC iteratively unrolls Tr , uses an SMT solver to find a bounded counterexample (which is usually decidable), and, if no counterexample is found, attempts to create an inductive invariant. The invariant is constructed as a set of so called *lemmas*, where each lemma blocks a predecessor of Bad (a *proof obligation*), and is a disjunction of atomic formulas. An example lemma is x ≤ 0 ∨ y, which often written as a set for convenience, i.e {x ≤ 0, y}. Many of the details of the algorithm are not important, and we omit them here. The step we focus on in this paper is *inductive generalization* (IG) (highlighted in blue in Fig. 2a), that is responsible for generalizing learned lemmas. In practice, IG is crucial for the performance of SMC.

Input: the original F-inductive lemma L = {`1, `2, ..., `n} Output: a generalized F-inductive lemma K ⊆ L

 K ← ∅ // kept literals C ← L // literals to check while C 6= ∅ do K, C ← dropOne(*K,C*) <sup>5</sup> return K function dropOne(*K, C*) lit ← pick(C) if isInductive(K ∪ C \ {lit}) then C ← C \ {lit} <sup>10</sup> else K ← K ∪ {lit} C ← C \ {lit} return K, C

Fig. 3: ITERDROP algorithm.

Conceptually, inductive generalization is a simple process, usually done with an algorithm similar to the one we call ITERDROP<sup>1</sup> , shown in Fig. 3. ITERDROP starts with a valid lemma L = {`1, . . . , `n}, and proceeds to generalize L by removing an arbitrary chosen literal from L, and using an SMT solver to check whether the lemma is still valid (by calling isInductive). The details of isInductive are not important – but it can be quite expensive. If the call succeeds, the literal is removed, otherwise it is kept. The goal is to generalize to a valid lemma with a minimal number of literals. From now on, when the context is clear, we use *generalization* instead of inductive generalization.

We illustrate ITERDROP with a sample run, shown in Fig. 4a. Start from the given lemma L = {x3, x1, x<sup>6</sup> = 1, x<sup>9</sup> − x<sup>10</sup> ≥ 41, x<sup>5</sup> = 1}, ITERDROP proceeds as follows:


The example highlights the difficulty of inductive generalization. First, each call to isInductive is potentially very expensive. Thus, reducing the number of the calls is highly desirable. Second, many of the calls, like steps 3 and 5 are "useless" – no new lemma is learned from them. Thus, reducing such "useless" calls is also highly desirable. Finally, a solver makes many (up to thousands) such inductive generalization calls per run.

Our *key insight* is that since generalization happens frequently, and, while the lemmas are different, the literals are similar, *it is possible to learn the co-occurrence between*

<sup>1</sup>While there are more advanced IG techniques, such as [23], we choose ITERDROP since it is used in SPACER– a state-of-the-art CHC solver.

*literals that do and do not occur in the same lemma together*. This co-occurrence, if learned, could then be used to improve inductive generalization!

Crucially, SPACER learns new literals all the time, and literals between different instances of the same problem are often similar, for instance, x<sup>1</sup> −2x<sup>3</sup> ≥ 20 and x<sup>1</sup> −2x<sup>3</sup> ≥ 25. Thus, an ML-based solution is useful to transfer knowledge between different sets of literals. Our method is inspired by the PoS-tagging problem in NLP, in which NNs automatically learn co-occurrence patterns between words and their tags. We elaborate more on this inspiration in Sec. V. We have also tried creating our own hand-crafted heuristics for directly calculating co-occurrence (for example, by using Boolean abstraction of literals), but none worked well in practice.

Concretely, we propose a novel neural network architecture, denoted by M, that learns from past IG queries, and is then used to predict answers for new IG queries. As shown in Fig. 4c, M outputs a binary mask (a list of zeros and ones) corresponding to literals that should be dropped or kept in the lemma. To evaluate M in the context of an SMC, we devise a new neural-based IG algorithm called XDROP, that has M at its core (Fig. 6). We have developed ROPEY, a prototype SMC that uses XDROP to guide SPACER. (Fig. 2b).

In Fig. 4b, we illustrate a run of XDROP on our example: (1) it runs M on the input L; (2) it creates a mask {0, 1, 0, 1, 0}, corresponding to a candidate Lcand = {x1, x9− x<sup>10</sup> ≥ 41}; (3) it checks the inductiveness of Lcand; (4) it accepts Lcand, and runs ITERDROP starting from Lcand. Note that XDROP runs only 3 inductiveness checks, compared to 5 used by ITERDROP.

Challenges. To make ROPEY a practical verification engine, we have to address challenges in both the machine learning and the logical soundness aspect. For machine learning, the challenge is in representing symbolic expressions as vectors, while still maintaining their rich semantic structure. For logical soundness, the challenge is in setting up the learning objective and using the neural net in a way that guarantees the soundness of a verification engine.

Representation learning of symbolic formulas. Literals are symbolic formulas, which are structured and have meaning sensitive to small changes. Simply viewing a literal as a sequence of tokens fails to capture the subtle semantic differences between structurally similar formulas.

We incorporate both syntactic and semantic information of a literal into its representation. Our approach views a literal as a directed acyclic graph (DAG), which is post-processed from its abstract syntax tree (AST), and then adapts TREEL-STM [44] to embed such a DAG structure. Our approach also takes semantic information into consideration so that specific properties of values are respected: embedding of numbers and variables should preserve their relative order and equality.

Learning for inductive generalization. Directly using ML to address the generalization problem is a non-trivial structure prediction problem. It takes in a set of symbolic formulas and outputs another set of symbolic formulas that are more general and more concise. Rather than having an

Fig. 4: Examples of how ITERDROP and XDROP do inductive generalization on the same query.

end-to-end ML solution, we embed a learning component in a classic symbolic approach of generalization. Specifically, the learning component captures the co-occurrence between literals appearing in past runs and predicts the likelihood of keeping or dropping a literal in the current run. Furthermore, uncertainties introduced by the learning component have to be carefully controlled, which otherwise could lead to unsound conclusion. ROPEY is designed to make sound progress no matter what predictions the learning component provides. Bad predictions may be harmful to the performance, but not to soundness!

#### IV. REPRESENTATION LEARNING

Machine learning frameworks [36] and algorithms [44], [38] operate over fixed-length numerical vectors. One challenge for applying machine learning for IG is converting discrete structures with rich semantic meanings into such numerical representations. In this section, we describe how we embed the basic unit of our inputs – symbolic formulas – into fixed-length vectors, while still maintaining their syntactic and semantic meaning to a certain extent.

#### *A. Representing and normalizing symbolic formulas*

Abstract Syntax Trees (ASTs) are natural representations of formulas that are traditionally used in parsing and compilers. They preserve the key structure of the formula, while hiding (or abstracting) unnecessary details such as white space, commas and parentheses. Alternative representations such as sequences of tokens abstract too much of the structure of the formula, while highlighting unnecessary differences. Thus, we represent logical formulas using their ASTs: operators label nodes of the tree, operands are children, constants (boolean and numeric) and variables are leaves. An example of an AST is shown in Fig. 5b.

Ideally, we would like to represent semantically equivalent formulas with the same AST. However, this is not guaranteed if one naively parses a formula into an AST. For example, x+ 0 > y and x > y are semantically equivalent, yet differ in the concrete syntax, *and* have different ASTs. To address this, we rewrite each formula in a "normal" form by simplifying as well as ordering commutative operators. Specifically, we use a simplification engine of Z3 [17]. Our normalizer cannot handle sophisticated semantic equivalences, such as normalizing 2/7· x<sup>9</sup> − 4/7 · x<sup>10</sup> ≥ 6 into 1/7 · x<sup>9</sup> − 2/7 · x<sup>10</sup> ≥ 3. Improving the normalization process to handle such cases would be an interesting future work.

Note that semantically equivalent rewriting and normalization make our representations of symbolic formulas essentially *directed acyclic graphs (DAGs) modulo semantic equivalence*, because semantically equivalent subtrees share the exact same embedding. Indeed, representations of symbolic formulas in our implementation are DAGs, although they are viewed as if they were trees by the embedding model. Without further notice, when we refer to a node in a tree, we actually mean its corresponding node in the DAG.

We use TREELSTM [44] to embed a symbolic formula, or more concretely its AST representation, into a fixedlength vector. TREELSTM is essentially a recursive process, where the embedding of a (sub-)tree is an aggregation of the embedding of the root node and embeddings of its subtrees. The basic requirement of using TREELSTM is to have an embedding for each node. In the rest of this section, we describe the features used to embed each AST node into a fixed-length vector.

#### *B. Embedding features of an AST node*

A common technique to map a node N to a vector is to first map the infinite (or simply large) set Σ of all possible nodes into a finite set T of tokens (a.k.a. *encoding*), and then *embed* each token into a vector using an embedding matrix of size |T| × demb.

*a) Encoding:* Under the standard encoding scheme, many nodes have to be mapped into the same token. For example, in NLP, all out-of-vocabulary words are mapped into a token <UNK>. Similarly, variable names, and numerical constants over an expression can be mapped into two tokens: <VAR> and <NUM>, respectively.

Unfortunately, this encoding scheme is inadequate in our setting. We believe that both the variable names and the values

Fig. 5: (a) The grammar for AST node features, and (b) an example AST and its semantic features.

of the numeric constants are highly relevant for successful generalizations! For example, consider two pairs of formulas:

$$x\_1 - 2x\_3 + 7x\_5 \ge 10 \qquad x\_1 - 2x\_3 + 7x\_5 \ge 14 \quad \text{(l)}$$

$$x\_1 - 2x\_3 + 7x\_5 \ge 10 \qquad \qquad x\_1 + x\_3 - x\_5 \ge 0 \qquad \text{(2)}$$

Pair (1) represents two parallel hyperplanes, with the first subsuming the second. Pair (2) represents two intersecting hyperplanes and cannot be simplified any further. The difference between the two pairs disappears when all numeric constants are mapped to a small finite set of tokens. Yet, this difference is crucial for successful learning in our context!

Instead of abstracting variables (or constants) into a single token, we propose a finer granularity abstraction as follows. Each node is abstracted as a pair of hKind, Valuei, whose grammar is shown in Fig. 5a. Kind captures the type (or sort) of the expression of an AST node. The encoding is one of the pre-defined symbols, such as hBOOL OPi for a Boolean operator, etc. Value captures the content of an AST node. It could be a *Variable Name*, an *Operator*, or a *Constant*. Operators are encoded as their string representation. Constants are encoded as their string representations. Variable Names are encoded using the form x\_i, where x is some fixed string, and i a numeric id of the variable.

Next, we describes how we embed the pair hKind, Valuei into a fixed-length vector.

*b) Embedding:* Kind is embedded into a fixed-length vector of length dKind using a standard embedding matrix [34] EKind of the size |Kind| × dKind. Value could be embedded in the same manner. However, given Value is quite diverse, we propose different embedding methods for different kinds of values. When Value is an Op, we introduce the second embedding matrix EOp of the size |Op| × dOp.

When Value is a Variable Name, we combine two embedding methods. The first method, which we call *Naive Embedding*, is the same as above, in which we use another embedding matrix EV ar of the size |Var| × dV ar. The second method, which we call *Positional Embedding*, based on the method introduced in [46]. It embeds the id t of the normalized variable name x\_t as follows: The embedding of the position t is a vector PE<sup>d</sup> (t) of length d. The value for the i th entry in the vector PE<sup>d</sup> (t) is defined as follows:

$$\text{PE}^d(t)\_i = \begin{cases} \sin(\omega\_k \cdot t) & \text{if } i = 2k\\ \cos(\omega\_k \cdot t) & \text{if } i = 2k+1 \end{cases}$$

where ω<sup>k</sup> = 10000<sup>−</sup>2k/d. This embedding satisfies many nice properties: each position is mapped to a unique value, all entries in the vector are between 0 and 1 (which makes learning easier), and, lastly, for every fixed offset k, there exists a transformation matrix T ∈ R d×d s.t. T ·PE<sup>d</sup> (t)<sup>i</sup> = PE<sup>d</sup> (t+k)<sup>i</sup> holds for any position t and index i [46]. This last property allows the model to learn relative positions easily. In practice, we combine the two methods by concatenating their vectors.

When Value is a Constant, we want to embed it in a way that allows the network to quickly extract magnitudes of constants along with their values. We propose the following *Constant Embedding* method: Given a numerical value p, its embedding is a vector CE<sup>n</sup> (p) of length 2(n + 1). To embed it, we first write p in its scientific notation: p = s × 10<sup>e</sup> . The entries in CE<sup>n</sup> (p) are then calculated as follows:

$$\begin{aligned} \mathbf{CE}^n(p)\_1 &= s \\ \mathbf{CE}^n(p)\_{i \neq 1} &= \begin{cases} 1 & \text{if } i = 2 + n + e \\ 0 & \text{if } i \neq 2 + n + e \end{cases} \end{aligned}$$

Simply put, we embed the significant s as the first entry in the vector, and the rest is the one-hot encoding of e in the range [−n, n]. For example, with n = 2, p = 42 = 4.2×10<sup>1</sup> , its embedding is CE<sup>2</sup> (42) = [4.2 0 0 0 1 0]. Similarly, CE<sup>3</sup> (0.42) = [4.2 0 0 1 0 0 0 0].

The final feature vector for a node is then the concatenation of the embedding of Kind and Value. In our experiments, we set dKind = dOp = dV ar = d = 64, and n = 6. We conclude this section with an example. Fig. 5b shows an AST for x<sup>9</sup> −x<sup>10</sup> ≥ 41 and its transformation into a tree of feature vectors, with n = 6 and d = 64.

#### V. LEARNING TO GENERALIZE

In this section, we elaborate on our insight first mentioned in Sec. III, then we describe the details of our model.


TABLE I: Two examples for PoS-tagging (left) and IG (right).

#### *A. Lemma Labeling Problem*

In Natural Language Processing, part-of-speech tagging (PoS-tagging) is the process of labeling each word in a text (corpus) a particular part of speech, based on *its definition and its context*. Table I (left) shows an example of tagging a sentence. To correctly tag each word, a tagger needs to know that "park" in this context is a verb, not a noun. State-of-the-art PoS-tagger tackles this problem purely from the probabilistic view [45]: in the dataset, how many times "park" is tagged as a NOUN, how many times "park" is tagged as a VERB given that the following word is tagged as an ADVERB, etc.

Our insight is that the inductive generalization could be viewed as a special case of PoS-tagging in which there are only two tags: drop and keep. Table I (right) shows one such example. We also view the problem in the same probabilistic way: in the dataset, how many times x<sup>3</sup> is kept, how many times x<sup>3</sup> is dropped given that x<sup>1</sup> is kept, etc. It is reasonable to expect there are shared patterns between different properties of the same system, or between different points in time of the same solving process. However, it is not expected that the learned pattern is transferable between different systems (x<sup>3</sup> in one system is completely different from x<sup>3</sup> in the others, just like "park" in English and Korean are completely different).

Formally, we define our problem as an instance of the *sequence labeling problems*:

Problem 1 (Lemma labeling problem) L *is the set of all possible literals. Given a list of literals* L *of length* n *and a vector* M *of zeros and ones,* |M| = n*, train a tagger* M : L <sup>n</sup> 7→ {0, 1} <sup>n</sup> *s.t.* M(L) ≈ M*.*

Note that in the problem definition we keep the lemma as a list instead of a set of literals. This means that given a different ordering from the same set of literals, we might end up with a different result. However, this is also the behavior of SPACER, because SPACER maintains the lemma as a list of literals, and pick(C ) in Fig. 3 simply returns the first element in C.

#### *B. Model*

To handle inputs of different lengths, we use two variants of the Long Short-Term Memory (LSTM) [25] network. At the high level, the information (hidden state) at each timestep t in a vanilla LSTM is −→h<sup>t</sup> <sup>=</sup> LSTM(it, −−→ht−1), where <sup>i</sup><sup>t</sup> is the input at timestep t, and a vector of zeros is used for the initial −→h0. Intuitively, the formula says that the hidden state at timestep t captures information from every *prior* timestep.

The first variant, Bidirectional-LSTM [38], has been shown to improve the labeling performance in NLP tasks [47]. It extends LSTM by including information from *later* timesteps as

Input: the original F-inductive lemma L = {`1, `2, ..., `n} Output: a generalized F-inductive lemma <sup>1</sup> LCand ← {`<sup>i</sup> | ` ∈ L, M(L)[i] = 1} <sup>2</sup> if isInductive(LCand) then <sup>3</sup> return iterDrop(LCand) <sup>4</sup> else <sup>5</sup> return iterDrop(L)

$$\text{Fig. 6: XDROP algorithm.}$$

well, thus, allowing the network to use better context information. Concretely, it adds the backward ←− h<sup>t</sup> = LSTM(it, ←−− ht+1). Then, the hidden state h<sup>t</sup> is the concatenation [ ←− h<sup>t</sup> , −→ht ].

The second variant, TREELSTM [44], has been shown to be suitable for tree-like inputs, such as ASTs. It extends LSTM by considering the linear chain of timesteps as a special case of a tree, in which each node has exactly one child. Given a node i<sup>j</sup> in a tree, with H(i<sup>j</sup> ) is the set of hidden states corresponding to each child node of i<sup>j</sup> , TREELSTM extends the equations with h<sup>j</sup> = T reeLSTM(i<sup>j</sup> , H(i<sup>j</sup> )). Intuitively, TREELSTM passes information from all children to their parent, allowing better topology information to be learned. In this work, we use the information at the root node as the summary of the whole tree.<sup>2</sup>

Fig. 4c shows our full model with a Bidirectional LSTM layer on top of a TREELSTM layer in a hierarchical manner. From top to bottom in Fig. 4c, at a literal `<sup>t</sup> corresponding to an AST with root Roott, we calculate the following:

$$\begin{array}{c} i\_t = \textit{TreeLSTM} \left( \textit{Root}\_t, H(\textit{Root}\_t) \right) \\ \xleftarrow{\uparrow} \textit{h}\_t = \textit{LSTM} \left( i\_t, \overrightarrow{h\_{t+1}} \right) \quad \overrightarrow{h\_t} = \textit{LSTM} \left( i\_t, \overrightarrow{h\_{t-1}} \right) \\\ h\_t = [\overleftarrow{h\_t}, \overrightarrow{h\_t}] \end{array}$$

where W ∈ R |ht|×2 and b ∈ R 2 are the weight matrix and bias that transforms h<sup>t</sup> to a vector of size 2. Each equation above corresponds to a layer in Fig. 4c. Finally, the predicted label for `<sup>t</sup> is the index of the max value of yt.

Fig. 6 describes how we use the learned model in our neuralbased IG algorithm XDROP. Given that deep learning models could make arbitrary predictions, special care need to be taken in order to preserve soundness. In the worst case, XDROP should be effectively the same as ITERDROP. More formally, we have the following important yet straightforward theorem.

#### Theorem 1 XDROP *is sound and terminating.*

XDROP is implemented in Python using PyTorch [36], while SPACER is implemented in C++. We implement a clientserver architecture in which XDROP is wrapped in a gRPC server which connects to a gRPC client inside SPACER.

#### *C. Discussion*

Using NNs to guide generalization might seem arbitrary at first. Perhaps a simpler heuristic based on counting frequency is sufficient. In fact, we have tried many different handcrafted heuristics first. However, two common problems arose: (a) the

<sup>2</sup> It is also possible to use the sum of every node in the tree as the summary, as mentioned in [44].

Fig. 7: M's predictive power for benchmarks with at least k IG queries.

heuristics do not work consistently across different benchmarks; (b) even if a heuristic works, it does not transfer to different properties since different literals are learned for different properties and systems.

There are many alternative ways to guide generalization using a neural component than the one we chose. Perhaps most desirable is to have an end-to-end solution in which the neural component takes an original lemma as input and produces a generalized lemma as output. However, the symbolic reasoning required for this is so complex that we believe that such a solution is much harder to train and scale up. Another alternative is to learn an approximation of the inductive check, i.e., the function isInductive(Context, L) 7→ {true, f alse} that determines whether a candidate lemma L is inductive in the current context. We have tried such an approach, but could not make it effective. The difficulty is that the Context that is used by the inductive checker is a large symbolic formula. This makes training the network difficult. We suspect it is as hard as *learning a neural SMT-solver* [40], [39].

#### VI. EMPIRICAL EVALUATION

#### *A. Benchmarks and environment setup*

We evaluate ROPEY on a set of simulation benchmarks publicly available <sup>3</sup> for the KIND2 model checker [11] (simply called KIND2 from now on). This benchmark suite corresponds to verification of systems that are known to be challenging for IG, for which SPACER behaves poorly. Furthermore, KIND2 benchmarks can be easily grouped into training set (i.e. a set of original benchmarks) and testing set (i.e. a set of corresponding variants). In total, KIND2 consists of 324 benchmarks.

We train ROPEY's neural network M using Adam optimizer [28] with dropout rate 0.5. We set the hidden size of TreeL-STM to be 64, and use embedding dimensions mentioned in Sec. IV.<sup>4</sup> We stop training when either the performance has not been improved over the last 250 epochs or the number of epochs reaches a predefined threshold (i.e. 1 500). Naive Embedding, Positional Embedding and Constant Embedding are always used. Ablation study for those embeddings is discussed in Sec. VI-E. All experiments are performed on a Linux desktop equipped with an Intel® Xeon E5-2680 v2, an NVIDIA 1080 Ti GPU, and 64GBs of memory. The artifacts including code and data are available on the project website at https://nhamlv-55.github.io/Ropey.

Given that evaluating benchmarks with a short running time (i.e. less than one second) is susceptible to noise, for all experiments we report both the numbers for all benchmarks and the numbers for non-trivial benchmarks. We define a nontrivial benchmark as the one that takes at least 5 seconds to solve, or has at least 100 IG queries (depending on whether we are measuring running time or predictive power, respectively).

#### *B. Predictive power*

We evaluate the model M in two settings, namely, *online learning* and *transfer learning*. Given a lemma in the form of a list of literals, M predicts a likely inductively generalized lemma, which is a sub-list of the given lemma. We define a prediction returned by M as a *perfect prediction* iff given the same input, vanilla SPACER produces the same exact answer. Note that this is a conservative criterion because there might be multiple valid inductive generalizations.

*Online learning* In this setting, we collect 144 benchmarks from KIND2 that have at least 2 IG queries in their solving trace. For each of them, we use SPACER to solve it until completion or until a time limit of 930 seconds is reached. Each solving trace is then split in half, and M is trained on the first half to predict the answers to queries seen in the second half of the trace (tail queries). We measure how many percent of the tail queries are perfectly predicted by M. The average length of queries is 9.75 literals.

M achieves 60.19% perfect prediction ratio for all benchmarks and 72.18% for non-trivial benchmarks. The trend of perfect prediction ratio along with the corresponding number of queries are plotted in Fig. 7a, where Y-axis is the perfect prediction ratio and X-axis is benchmarks ordered according to their total number of IG queries. The plot shows that M generally works better for larger benchmarks. For instance, M returns perfect predictions for more than 90% of the queries in benchmarks with 1 600 or more IG queries.

*Transfer learning* In this setting, we use 123 benchmarks (i.e., 30 seed benchmarks and 93 variant benchmarks) from KIND2 based on their naming convention. For example, metros\_2\_e1\_1116.smt2 is one variant of metros\_2.smt2. Note that we have fewer benchmarks in this task since some seed benchmarks can be solved without any IG queries, while its variants cannot. Those seeds and variants are all excluded from the task. The average length of the queries for this task is 8.43 literals.

We train M on traces generated by solving the seed benchmarks to completion or until timeout. The models are then used to predict queries asked during the solving process of the corresponding variants.

M achieves 68.36% and 76.89% perfect prediction ratio for all benchmarks and non-trivial benchmarks, respectively. We also plot the trend of perfect prediction ratio in Fig. 7b.

<sup>3</sup>https://github.com/kind2-mc/kind2-benchmarks.

<sup>4</sup>These dimensions could be further fine-tuned, which we leave as interesting future work.

Fig. 8: ROPEY's speedups for benchmarks taking more than s seconds to solve.


TABLE II: ROPEY's speedups compared with SPACER.

Similar to the online learning setting, M generally works better for larger benchmarks. It is a little surprising that the perfect prediction ratio of transfer learning setting is slightly better than the ratio of online learning. This might indicate that in our benchmarks, queries in the beginning and at the end of the same benchmark are more different than queries between seeds and variants. Quantifying this observation is an interesting direction for future work.

#### *C. Running time*

ROPEY's running time can be broken down into few components: SPACER's time (in which IG time is a subcomponent), communication time over gRPC, data parsing time, and model running time. We group the later three components into *inferencing time*. On average, inferencing takes 48.1% and 24% of the total running time for all and non-trivial benchmarks, respectively. For future work, we state that there are opportunities for engineering improvement to reduce the inferencing time.

We measure the speedup in IG time and SPACER's solving time with and without the inferencing time. If ROPEY times out, we measure the running time that ROPEY needs to verify to the same depth as SPACER. The timeout is set to be 930 seconds, and in cases where ROPEY times out, we rerun it with the timeout set to 2 790 seconds to allow it to verify to the same depth as SPACER. The results are in Table II. We also plot in Fig. 8 the speedups achieved at different running time threshold s, e.g for benchmarks that takes more than 50 seconds to solve, 100 seconds to solve, etc.

For unsolved benchmarks, notice the spikes at the tail of Fig. 8: ROPEY takes much less time to reach to the same depth as SPACER, up to 2.8× faster (inferencing time included).

#### *D. Training time*

In this paper, we specifically consider realistic applications where training time is not a bottleneck – train once on one instance and apply to many similar instances (offline), or train during a very long run (days or weeks) and apply to the rest of

Fig. 9: Effects of using different embeddings for benchmarks with at least k IG queries.

the run (online). For that reason, we do not optimize training code, nor do we run training in an isolated environment where time measurements are meaningful. Nonetheless, we share some statistics of the training time – the minimum, median and maximum training time are 17, 1027 (17 minutes), and 165811 seconds (46 hours), respectively. More details are hosted on our project webpage https://nhamlv-55.github. io/Ropey/training time. Training any individual model (i.e., when GPU is used to train only a single model) is faster, but training all models sequentially is too slow. Since we do not consider training time itself to be of significant interest, we train as many models in parallel as possible.

#### *E. Ablation study*

Embedding variables and constants is crucial for our tasks. In this ablation study, we evaluate three embeddings we proposed in Sec. IV-B for handling variables and constants. Fig. 9 shows four plots of ROPEY with four different embedding configurations. ROPEY achieves the best performance when all embeddings are enabled. ROPEY's performance drops dramatically when the positional embedding is disabled, indicating leveraging variable's position information helps for capturing co-occurence patterns. Disabling Naive Embedding or Constant Embedding does not affect the performance much for benchmarks with relatively small number (i.e. < 1 000) of IG queries, however, the performance drops dramatically when the number of IG queries becomes large.

#### VII. RELATED WORK

There has been a number of work studying neural learning for symbolic reasoning. Some studied the capability of deep learning models on handling relatively simple symbolic reasoning tasks, such as symbolic expression equivalence [1] or logical entailment [19], which can be easily performed by a symbolic engine like SMT solver. [2] and [37] focus on learning embeddings of programs using paths over abstract syntax trees or control flows, and the learned embeddings are helpful for suggesting function or variable names. Our focus is on improving state-of-the-art symbolic engines on non-trivial symbolic reasoning tasks like symbolic model checking. The most relevant work is [4], which predicts a high-level strategy (or configuration) of an SMT solver based on *static* statistics of a verification instance. In contrast, our approach learns from *dynamic* runs and provides guidance for decisions in a finer granularity. Two other related work are [24] and [42]. The former also uses deep learning to guide numerical analysis, where the soundness is not a concern as imperfect prediction results in less precise (but still acceptable) numerical approximations. Like our problem, the latter also faces the soundness issue and proposes an end-to-end reinforcement learning based approach, which however suffers from scalability issues.

#### VIII. CONCLUSION

In this paper, we explore how deep neural networks can be used in IC3. We chose inductive generalization because it is (a) a common bottleneck; and (b) seemed suitable to optimize with NNs. We view this as a first step in using datadriven NNs to guide IC3. Specifically, we propose a datadriven approach to improving inductive generalization, which effectively embeds symbolic formulas in fixed-length vectors and uses a hierarchical recurrent neural network to guide inductive generalization (i.e., predict which literals of a lemma should be kept or dropped). We build a prototype, ROPEY, and evaluate it on KIND2 benchmark suite. We observe promising predictive power of neural networks in inductive generalization and modest improvement in terms of absolute running time over the state-of-the-art SMC engine, SPACER, which boosts the solving time for non-trivial instances by 25%.

Our work shows that it is possible for NNs to learn complex symbolic patterns in IC3, and such learned patterns can be used to improve IC3. ROPEY's pure performance does not show a strong gain yet, but is still encouraging. We envision the performance gain would be much more significant by improving ROPEY with better engineering effort or leveraging advanced hardware acceleration for deep learning models in the future (like TPUs). Another orthogonal improvement is to explore more advanced transformer-based language models like GPT-3 [9] to further improve the prediction accuracy.

#### REFERENCES


# Model Checking AUTOSAR Components with CBMC

Timothee Durand<sup>∗</sup> , Katalin Fazekas† , Georg Weissenbacher† and Jakob Zwirchmayr<sup>∗</sup> <sup>∗</sup>TTTech Auto AG, Vienna, Austria †TU Wien, Vienna, Austria

*Abstract*—Automotive software needs to comply with stringent functional safety standards to reduce the risk of malfunction. In particular, the ISO 26262 standard highly recommends the use of formal verifcation for highly safety-critical software components. Automated formal verifcation techniques (such as Model Checking) enable the quick detection of intricate software bugs and can, to a limited extent, even guarantee their absence.

We report our efforts to deploy the openly available verifcation tool CBMC to verify AUTOSAR Software Components and Complex Device Drivers using Bounded Model Checking and k-induction combined with upfront static analysis.

#### I. INTRODUCTION

Modern cars now contain as many as 150 Electronic Control Units (ECUs) running software from different suppliers. AUTOSAR, an open and standardized software architecture for automotive applications, guarantees the interoperability of automotive software components. This platform provides a common development methodology based on a standardized exchange format for describing software components (ARXML), standardized communication interfaces and a Run-Time Environment (RTE), and a basic software (BSW) layer (see Fig. 1). The BSW comprises hardware-specifc software modules (including Complex Device Drivers (CDDs)) that provide functions to the upper software layers. The RTE middleware provides interfaces and functions for inter- and intra-ECU communication between the application software components. Software Components (SWCs) in the application layer access the lower layers via the RTE, and can hence be readily deployed on different vehicle and platform variants.

The ISO 26262 [1] functional safety standard establishes safety requirements for automotive components (including software). The norm defnes four Automotive Safety Integrity Levels (ASILs) ranging from A (low risk) to D (lifethreatening hazards). ASIL-D requires the highest degree of rigor, including (semi-)formal verifcation in the development process. Consequently, formal methods are frequently applied in industrial dependable system design [2]. Moreover, ASILcode needs to be reverifed whenever the implementation is changed, re-generated, or re-confgured.

In this context, automated static analysis techniques (such as abstract interpretation or software model checking [3], [4]) are particularly attractive, as they require comparatively little manual interaction and can detect intricate software bugs and, to a limited extent, even guarantee their absence.

We investigate the applicability of model checking to AU-TOSAR code written in ANSI-C. While commercial tools for

Fig. 1. AUTOSAR Architecture

static analysis of AUTOSAR code exist [5], we focus on the software model checking tool CBMC [6] because of the tool's availability, sustained development, and its permissive open source license. The latter allowed us to adapt CBMC to our work-fow and requirements: the specifcs of AUTOSAR software and the ISO 26262 requirements (such as the ARXML description, the use of the RTE, and repeated verifcation runs) imposes the need for an automated tool chain.

Contributions. Our report (based on the master's thesis of the frst author [7]) describes the following contributions:


# II. METHODOLOGY

To verify our SWCs and CDDs (described in subsect. III-A), we need to (1) generate the verifcation environment and (2) instrument and augment the code with static analysis results.

#### *A. The AUTOSAR Platform*

AUTOSAR uses three abstraction levels to describe the SWCs of a system. The highest level—the Virtual Function


Fig. 2. Entry points for k-Induction experiments to prove property P

Bus (VFB)—describes types of SWCs and their connections to other SWCs (PortInterfaces and PortPrototypes), as well as the messages they exchange via their ports (DataTypes). At the middle level—the RTE—the execution behavior of SWCs, i.e., RunnableEntities and their trigger events, are defned. Finally, at the implementation level, these defned RunnableEntities are mapped to their implementations (given as source or object code).

System constraints and the system confguration are described in the ARXML format (see Fig. 3 for an example). In the given context, the SWC Description and the RTE Extract of the ECU Confguration are of relevance, since they describe the messages and data-types that SWCs can exchange.

#### *B. Generating Verifcation Environment*

The RunnableEntities of an SWC (defned in the corresponding ARXML model [8]) provide initialization and step functions, which are invoked periodically in an order we presume to be fxed (see also sect. V).

BMC focuses on checking the correctness of the program only up to a predetermined number of iterations of each loop, pruning all executions that require more. The entry point of our generated test harness for BMC is a function which, after initialization, calls the step functions of the RunnableEntities in an (unbounded) loop.

The test harness for k-Induction<sup>1</sup> has two entry points: one for the base case and another for the inductive step. Fig. 2 illustrates the principle of k-Induction: BMC is used to establish the base case by checking whether the assertion P holds for the frst K loop iterations. Subsequently, we use BMC to check whether P holds after K + 1 steps under the assumption that it holds in the frst K iterations starting from an *arbitrary* program state. If both the base case and induction step succeed, then P holds after any number of loop iterations.

SWCs exclusively interact with each other and with the BSW through the RTE (see Fig. 1), and RTE ports are their only external input [9]. We assume the correctness of the RTE implementation and replace it with an appropriate abstraction. This has two consequences: Firstly, it results in a smaller code base that is more tractable for verifcation tools. Secondly, as our RTE abstraction conservatively models the most general environment of the SWC, it takes arbitrary interactions with the environment (e.g., any communication via the RTE) into account. This modular approach guarantees that a change in

<sup>1</sup>CBMC's built-in support for k-Induction did not cope with the nested loops in our SWCs, which is why we require a separate harness.

```
1 <IMPLEMENTATION-DATA-TYPE UUID="...">
 2 <SHORT-NAME>Dt_Engine_RPM</SHORT-NAME>
 3 ...
 4 <COMPU-METHOD-REF DEST="COMPU-METHOD">
 5 /DataTypes/CompuMethods/CM_Engine_RPM
 6 </COMPU-METHOD-REF>
 7 <IMPLEMENTATION-DATA-TYPE-REF DEST="...">
 8 /AUTOSAR_Platform/ImplementationDataTypes/uint16
 9 </IMPLEMENTATION-DATA-TYPE-REF>
 10 ...
 11 </IMPLEMENTATION-DATA-TYPE>
 12 ...
 13 <COMPU-METHOD UUID="...">
 14 <SHORT-NAME>CM_Engine_RPM</SHORT-NAME>
 15 ...
 16 <COMPU-SCALE>
 17 <LOWER-LIMIT INTERVAL-TYPE="CLOSED">0</LOWER-LIMIT>
 18 <UPPER-LIMIT INTERVAL-TYPE="CLOSED">255
 19 </UPPER-LIMIT>
 20 <COMPU-RATIONAL-COEFFS>...</COMPU-RATIONAL-COEFFS>
 21 </COMPU-SCALE>
 22 ...
 23 </COMPU-METHOD>
 i void modif_nondet_Dt_Engine_RPM(Dt_Engine_RPM* tmp);
 ii void modif_nondet_uint16(uint16* tmp);
iii Std_RetType get_nondet_Std_ReturnType();
iv Std_RetType
 v Rte_Read_Engine_RPM_stub(Dt_Engine_RPM* tmp);
vi
vii void modif_nondet_Dt_Engine_RPM(Dt_Engine_RPM* tmp) {
viii modif_nondet_uint16(tmp);
ix assume(0 <= *tmp && *tmp <= 255);
 x }
xi
xii Std_RetType
xiii Rte_Read_Engine_RPM_stub(Dt_Engine_RPM* tmp){
xiv modif_nondet_Dt_Engine_RPM(tmp);
xv return get_nondet_Std_ReturnType();
xvi }
```
Fig. 3. Parts of ARXML specifcation of data type Dt\_Engine\_RPM (above) and an example of using it in generated RTE function stubs (below)

the environment (e.g., the deployment of other components) does not invalidate prior verifcation results.

The ARXML specifcation [10] and the AUTOSAR meta model [8] describe the DataTypes of messages, allowing us to automatically generate an abstraction of the RTE communication functions. Fig. 3 depicts parts of a specifcation in the ARXML format that defnes data types on different abstraction levels. Lines 7-9 state that Dt\_Engine\_RPM is implemented as uint16. Lines 4-6 refer to a CompuMethod element that specifes a range of valid values from 0 to 255 for the data type. These limits guarantee that the computation will result in a value representable by uint16. For a thorough defnition of data types and their constraints see [8, Sect. 5].

In our RTE abstraction parameters and return values of RTE functions are frst havoced and then constrained based on information provided in the ARXML specifcation. These constraints are automatically generated. We generate nondeterministic modifer and generator functions that are invoked in the generated RTE API stubs (see, e.g., function Rte\_Read\_Engine\_RPM\_stub in Fig. 3). Fig. 3 also illustrates how the data constraints defned by the XML in lines 17-18 translate into a C assumption (line viii) due to the type Dt\_Engine\_RPM.

### *C. Static Analysis and Instrumentation of Code*

As a next step, the verifcation target SWC source code, its dependencies and the generated RTE stubs are built and linked into a single object with CBMC. Though our software project is complex and uses many architectural parameters, CBMC's goto-cc could seamlessly replace the compiler and linker in our build process. We note that, in accordance with the ISO 26262 standard, our code base is written in a well-specifed and supported sub-set of the ANSI-C language.

Before starting the verifcation with CBMC, we perform an upfront static analysis of the code to support and complement the strengths of CBMC. To this end, we emit the complete target project into a single source fle and run Frama-C [11] on the resulting code. While Frama-C provides a wide range of static analysis techniques, we only employed its Evolved Value Analysis (EVA [12]) plug-in, which is based on abstract interpretation techniques. We used its default parameters that do not rely on more advanced abstract domains. This analysis can infer relatively small value sets for the variables (including function pointers), which simplifes the task of CBMC, but also provides indispensable type constraints for constructing induction proofs in some of our k-Induction experiments. The results of the static analysis are automatically incorporated as assumptions constraining the values of global variables (which represent the entire state of the system) and as replacements of function pointers with explicit case statements.

Prior to instrumentation of the code with the constraints provided by Frama-C, we verify (in independent k-Induction runs) that the value sets provided by Frama-C are actually inductive invariants. To verify the results of the function pointer analysis, the bodies of functions that are unreachable according to Frama-C are replaced with failing assertions which are then checked using CBMC.

#### *D. Implementation details*

To automatically parse the ARXML specifcations, RTE headers and to generate C stubs, we relied on several openly available Python modules (e.g. PyCParser [13], lxml [14], and cogu-autosar [15]). Some missing POSIX stubs were implemented manually, and we had to patch CBMC to emit proper C code for the SWCs in our experiments.

#### III. CASE STUDIES

#### *A. Component Descriptions*

We analyse four AUTOSAR SWCs of an automotive software platform that comprises of ECUs with multiple hosts. The platform provides services such as a common time-base for the hosts, global time-triggered scheduling, and time-triggered or time-sensitive communication between hosts. A custom RTE hides the fact that the underlying system is distributed and hosted on multiple SoCs/CPUs from the Application SWCs.

*LifeCycle Service Server (LCS-S) component:* This component is typically executed on the host with the highest ASIL and implements a state machine that determines the state (Init, Standby, Running, etc.) of each host. Running, for instance, indicates that the platform started up successfully and all hosts are operating under supervision. State transitions are triggered by failing built-in self tests, or depend on the states of other services. The LCS-S sends requests to its clients to trigger transitions and ensures that all client hosts transition correctly and report the expected lifecycle states.

While the LCS-S communicates with other SWCs via the RTE, it is considered a CDD because it directly interacts with other health- and safety-related platform services implemented as CDDs. These interactions via non-standardized interfaces require a few LCS-specifc extensions of the verifcation environment and hence knowledge about implementation details.

*LifeCycle Service Client (LCS-C) component:* implements the same state machine as the LCS-S and periodically checks whether state transitions are required or have been requested by the LCS-S. An example for a transition requested by the LCS-S and confrmed by the LCS-C is the power-off sequence, where clients might store data in non-volatile memory.

*Vehicle Communication Service (ApCom) component:* This Application SWC is typically either ASIL-B or D and receives messages from the CAN bus (via the corresponding service in the BSW) and transforms them into RTE data types. Thus, the developers need not be aware of the underlying CAN specifcs.

As ApCom utilizes only RTE and BSW COM interfaces, it can be model checked with a generic abstraction of these interfaces. Since large parts of the confguration and the implementation are generated based on a mapping between the CAN and RTE messages, the repeated (automated) verifcation of this generated code is frequently necessary.

*Middleware:* This component is a CDD that communicates with other hosts through a Transport Layer (e.g. Ethernet or a time-sensitive version thereof), often relying on OS system calls. Since the exchanged messages contain RTE data, it requires non-standardized interaction with the RTE (such as access to its buffer management system), which complicates verifcation. While the implementation of the buffer management is static, generated or confgurable parts of the code introduce the need for repeated analysis. Since it handles ASIL data, the Middleware may be classifed up to ASIL-D.

Table I presents some code metrics for each SWC to illustrate their complexity. More details are available in [7, Section 5]. The components of the LifeCycle service are simpler than the other SWCs, with the LCS-S being the more complex one of both due to supervision and platform initialization tasks. The ApCom component relies heavily on calls-by-reference and function pointers, as evidenced by the amount of pointer arithmetic and dereference operations. Its buffer and data frame manipulation operations make the Middleware the most challenging component of our case study. The high complexity metrics for ApCom and Middleware also denote the presence of large chunks of generated code with repetitive structures within these components.

#### *B. Checked program properties*

Our goal is to automatically detect potential errors and vulnerabilities (expressed as assertions) in our code base. In addition to assertions added by developers, we check the


TABLE I CODE METRICS OF TARGET SOFTWARE COMPONENTS

TABLE II RUNNING TIMES FOR STATIC ANALYSIS OF THE TARGET SWCS


properties automatically generated by CBMC (e.g. possible arithmetic overfows, safety of pointer dereferences; see [6]). To enable k-Induction, we instrumented our code base with the necessary assumptions and assertions similarly to Fig. 2. In the k-Induction experiments, we additionally checked constraints on permissible values of variables (e.g., to identify invalid states in the LifeCycle service). Note that defning these latter properties is a manual step that requires insights into the implementation details and the in-depth understanding of the application domain, while the other introduced assertions are automatically constructed.

#### *C. Experiments and Results*

For verifcation we used CBMC 5.23. All experiments were conducted on an Intel(R) Xeon(R) CPU E5345@2.33GHz equipped with 47.2 GB of memory, running Ubuntu 18.04.4. For each run, we set a memory limit of 40 GB and a CPU time limit of one hour, measured by the tool BenchExec [16].

*1) Static Analysis:* We introduced static analysis into our work-fow to address three challenges. First, to avoid spurious counter examples that were due to imprecise value analysis (see for example our k-Induction experiments later in this section). Second, in some of our benchmarks, due to the imprecise value analysis of the function pointers, cycles in the call graph led to non-termination of CBMC. Finally, the computed call graph allows us to identify and exclude code that is not part of the targeted code base, but is still included in the compilation process. The difference in size (lines of codes) before and after slicing unreachable functions in the input fle is given Table II. Hence, in our experiments static analysis is an essential preprocessing step that provides valuable benefts.

To gain these benefts, however, an exhaustive static analysis of the code base for each SWC is necessary. Table II presents the running time and memory requirements of this step for each SWC. Note that this analysis includes a precise value analysis for every global variable and function pointer of the code base and removes the unreachable sections of the SWCs.

*2) Bounded Model Checking:* We considered 5 iterations of the loop calling the RunnableEntities of our SWCs (cf. subsect. II-B). As most loops in automotive real-time software are statically bounded, CBMC was able to automatically determine bounds for most other loops. In addition, CBMC can detect whether there exist executions that iterate the loop more often than pretermined by the given bound, which we used to identify loops that needed to be bounded manually (of which there were less than 10 overall).

Table III (left) summarizes our BMC results, providing for each SWC the number of checked assertions, memory usage, and run-time. Though no real bugs were found, our verifcation attempts revealed a modelling faw in the ARXML specifcation of the ApCom SWC. In our frst verifcation attempt, CBMC reported an arithmetic overfow in ApCom. Analyzing the report showed that the ARXML specifcation of the data type of one of the involved variables (whose value was provided by our ARXML-derived RTE abstraction) was too permissive. As the actual implementation of the RTE is more restrictive, this overfow cannot occur in practice.

We identifed a similar problem with the ARXML-derived RTE model of the LCS-C component, which yielded a Not Present state that is unreachable in the actual implementation. This revealed a limitation of our modular verifcation approach, which lacks precise information about the states reachable in other (abstracted) components. As before, this bug cannot occur in the implementation.

The Middleware turned out to be too challenging to verify in our experiments. Attempts to simplify the program (by e.g. abstracting away the initialization of shared memory regions which introduced large arrays in the resulting formulas) led to numerous spurious error reports, rendering the approach impractical. Since CBMC did not support some necessary operations, our attempts to deploy a Satisfability-Modulo-Theory (SMT) solver as back-end also failed.

*3)* k*-Induction:* The right part of Table III presents the results of our k-Induction experiments. The run-times are the sum and the memory requirements are the maximum of the two consecutive CBMC runs for the base case and induction step (see Fig. 2). In our experiments, we observed that a value of 1 is suffcient in all our (terminating) runs to prove the properties, which we attribute to the auxiliary constraints provided by the upfront static analysis. Hence, k-Induction uses fewer resources than BMC in our setting.

Moreover, the value constraints provided by Frama-C proved to be crucial. Our verifcation attempts without static analysis led to spurious reports of out-of-bound array accesses in the LCS-S component. This is owed to the fact that the initial states (of the state machine) in the induction step (Fig. 2) are arbitrary and hence potentially unreachable in

TABLE III EXPERIMENTAL RESULTS OF BOUNDED MODEL CHECKING AND k-INDUCTION


the actual implementation. The value set information provided by Frama-C constrains the initial states to reachable states and strengthens our induction hypothesis. Other components (LCS-C and ApCom) could be verifed even without the use of Frama-C. As in our BMC experiments, our attempts to verify the Middleware timed out.

For a comparison of (an older version of) CBMC to alternative software model checking tools (such as CPAChecker [17] and Ultimate Automizer [18]) on the presented SWCs, see [7] (Section 6, pages 44-45).

# IV. RELATED WORK

Ahmed and Safar [19] use the symbolic simulation tool KLEE [20] to automatically extract test cases from the C source code of an AUTOSAR BSW module. As testing of safety-critical applications must be requirements-based [1], generated test-cases need to be mapped to requirements. In their CBMC-based automated testing method for the avionic domain, Sun et al. [21] annotate the source code with lowlevel requirements (expressed as pre- and post-conditions) to establish such a mapping. Mittag [22] applies static analysis to AUTOSAR components, focusing on comparatively simple properties. Berger et al. [23] apply the CBMC-based verifer BTC [24] to check automotive code generated by Simulink, but do not address AUTOSAR. Fang et al. [25] use the SPIN model checker to verify a hand-crafted model of an AUTOSAR-based operating system. Westhofen [26] implements custom k-Induction on top of CBMC to effciently verify automotive C code.

# V. DISCUSSION AND CONCLUSION

Automation was a primary goal, as it enables automated regression verifcation and limits the effort for the verifcation engineer. The CBMC model checker and its mature ANSI-C support allowed to use our existing build system and largely unmodifed code base. The ARXML component descriptions and the layered architecture of AUTOSAR made it possible to delimit the SWCs and automate the generation of a test harness and stubs that abstract the behaviour of the RTE.

We did, however, face challenges regarding automation, modeling the environment, and scalability. Unlike SWCs, CDDs are not standardized by AUTOSAR. They may use interfaces that are not available to standardized SWCs (e.g., to directly access peripherals). Consequently, the stubs for nonstandardized interfaces specifc to a CDD need to be generated manually. Moroever, even for SWCs, an overly abstract model of the RTE may lead to false positives. This can be addressed by providing a more precise model of the RTE (requiring substantial insight into the details of the RTE) or by including actual RTE code. The latter approach, however, amounts to verifying the SWC in the *absence* of an environment.

As CBMC provides limited support for static analysis, we combined it with an upfront run of Frama-C in order to reduce the computational effort for the model checking – interfacing the tools required a non-trivial implementation effort.

Preliminary experiments showed that verifying multiple, interacting components reduces spurious bug reports. This, however, would require to take into account all execution schedules of the runnables, which we consider future work. Another future work is to reuse our verifcation efforts of the presented SWCs whenever a repeated analysis is necessary (i.e. when the implementation is changed or re-confgured) by considering incremental verifcation techniques.

Overall, our conclusion and outlook is positive: despite all challenges and the engineering effort required to deploy CBMC to verify AUTOSAR components, we ultimately succeeded in checking non-trivial and realistic SWCs.

# ACKNOWLEDGMENTS

This work was partially funded by the Vienna Science and Technology Fund (WWTF) under grant NXT19-006. The authors thank the anonymous reviewers for their valuable feedback and suggestions.

#### REFERENCES


in the automotive domain," in *Symposium on Formal Methods (FM)*, ser. LNCS, vol. 10951. Springer, 2018.


# Automating System Confguration

Nestan Tsiskaridze , Maxwell Strange , Makai Mann , Kavya Sreedhar , Qiaoyi Liu ,

Mark Horowitz , Clark Barrett

Stanford University, Stanford, CA 94305, USA

E-mail: {nestan, mstrange, makaim, skavya, joeyliu}@stanford.edu, horowitz@ee.stanford.edu, barrett@cs.stanford.edu

*Abstract*—The increasing complexity of modern confgurable systems makes it critical to improve the level of automation in the process of system confguration. Such automation can also improve the agility of the development cycle, allowing for rapid and automated integration of decoupled workfows. In this paper, we present a new framework for automated confguration of systems representable as state machines. The framework leverages model checking and satisfability modulo theories (SMT) and can be applied to any application domain representable using SMT formulas. Our approach can also be applied modularly, improving its scalability. Furthermore, we show how optimization can be used to produce confgurations that are best according to some metric and also more likely to be understandable to humans. We showcase this framework and its fexibility by using it to confgure a CGRA memory tile for various image processing applications.

#### I. INTRODUCTION

In systems engineering, the *system confguration* problem arises when systems are parameterized to increase their fexibility or functionality. It refers to the problem of choosing the appropriate parameter values for the context or application in which the system will be used. Most hardware and software systems, including hardware IPs, operating systems, networks, servers, and data centers, require some degree of confguration. The need for confguration also often arises when integrating decoupled parts of a system, including integrating software and hardware.

The diffculty of the system confguration problem has been gradually growing as systems increase in scale and complexity. In particular, in an effort to make designs more widely applicable and re-usable, there has been an increasing use of hardware that is confgurable, not only at design time or setup time, but even during normal operation. Manual confguration of such systems is error-prone and may even be impossible, depending on how frequently the systems need to be reconfgured.

Automation of the confguration problem can also be benefcial during the system design process. In particular, it obviates the need for new hand-coded confguration fles every time some confgurable component changes. Increased automation of such steps supports a move towards more agile design processes. Agile approaches typically require the ability to rapidly and (largely) automatically integrate changing parts of a system while continuously maintaining correct endto-end functionality. Having design blocks that are fexibly confgurable aids this effort, as does the ability to automate the confguration.

A potential disadvantage of automated confguration is that it could lead to an increase in the opacity of the overall system. Hand-written confgurations can be documented and explained to allow for easier understandability and maintainability. Thus, an additional goal when automating confguration should be to produce results that are comprehensible to humans and that can be easily reviewed and maintained.

In this paper, we present a general framework for automated system confguration. It provides a fexible approach for solving the confguration problem for systems composed of software, hardware, or both. The systems are modeled using transition systems, where transition formulas can use the full expressive power of SMT-LIB [1], the language used by satisfability modulo theories (SMT) [2] solvers. The framework provides a systematic approach to facilitate fully automated or automation-guided system confguration. It is well-suited for both stand-alone designs and for designs with multiple confgurable parts. For the latter, it is especially useful during system integration and rapid development.

The main contributions of this paper are:


The remainder of the paper is organized as follows. Section II presents background and notation. Section III formalizes the confguration solving problem and introduces our framework, including some extensions and limitations. In Section IV, we show how optimization techniques can be integrated into the approach, both for the purpose of improving performance as well as for improving human readability, and we discuss a few additional extensions of the framework. In Section V we present a case study, giving the details of a specifc system design and showing how our framework can be applied. Experimental results for this case study are

then reported in Section VI. We survey the related work in Section VII and conclude in Section VIII.

#### II. BACKGROUND

We assume the standard many-sorted frst-order logic setting with the usual notions of signature, term, formula, and interpretation. A theory is a pair T = (Σ, I) where Σ is a signature and I is a class of Σ-interpretations, i.e., the models of T . A Σ-formula φ is satisfable (resp., unsatisfable) in T if it is satisfed by some (resp., no) interpretation in I. We defne |=<sup>T</sup> over Σ-formulas: if φ and ψ are Σ-formulas, then φ |=<sup>T</sup> ψ if all interpretations which satisfy φ also satisfy ψ. In this case, we also call φ an abduct of ψ under T . For generality, we assume an arbitrary but fxed background theory T (which could be a combination of theories) with signature Σ and an infnite set X of variables. We will assume that all terms and formulas are Σ-terms and Σ-formulas whose free variables are in X , that entailment is entailment modulo T , and that interpretations are T -interpretations that assign every variable in X .

Given an interpretation I, a variable assignment s over a set of variables V is a mapping that assigns each variable v ∈ V of sort σ to an element of σ I , denoted v s . The assignment over V *induced* by an interpretation I (i.e., the assignment that maps each variable in V to its interpretation in I) is denoted I V . The assignment s restricted to the domain U ⊆ V is denoted by s <sup>U</sup> . We write I[s] for the interpretation that is equivalent to I except that each variable v ∈ V is mapped to v s . We write f ◦ g for functional composition, i.e., f ◦ g(x) = f(g(x)).

Satisfability Modulo Theories (SMT). Satisfability Modulo Theories [2] is an extension of the Boolean satisfability (SAT) problem to satisfability in frst-order theories. SMT solvers combine the Boolean reasoning of a SAT solver with specialized theory solvers to check satisfability of manysorted frst-order logic formulas. Some examples of commonly supported theories are: fxed-width bit-vectors, uninterpreted functions, linear arithmetic, and arrays. In our case study, we utilize fxed-width bit-vectors for modeling a hardware design.

#### Symbolic Transition Systems.

A symbolic transition system (STS) S is a tuple S := ⟨V, I, T⟩, where V is a fnite set of state variables (possibly of different sorts), I(V ) is a formula denoting the initial states of the system, and T(V, V ′ ) is a formula expressing a transition relation, with V ′ defned as follows. Let prime be a bijection that maps each variable v ∈ V to a new variable (not in V ) v ′ of the same sort. V ′ is the codomain of prime.

A state s of S is a variable assignment over V . A sequence of states is called a *path*. An *execution* of S of length k is a pair ⟨I, π⟩, where I is an interpretation and π := s0, s1, . . . , sk−<sup>1</sup> is a *path* such that I[s0] |= I(V ) and I[s<sup>i</sup> ][si+1 ◦prime<sup>−</sup><sup>1</sup> ] |= T(V, V ′ ) for all 0 ≤ i < k − 1.

#### Unrolling and Bounded Model Checking.

An *unrolling* of length k of a symbolic transition system is a formula that captures an execution of length k by creating copies of the transition relation. This is accomplished by introducing fresh copies of every state variable for each state in the execution path. We use V @i to denote the set of variables obtained by replacing each variable v ∈ V with a new variable called v@i of the same sort. We refer to these as *timed* variables. Given an STS S, let unroll(S, k) = I(V @0) ∧ ⋀ <sup>0</sup>≤i<k T(V @i, V @(i + 1)).

Bounded model checking (BMC) [3] is an unrolling-based symbolic model checking approach. Let P(V ) be a formula representing a desired property of a symbolic transition system. BMC creates an unrolled transition system and adds an additional constraint that the property is violated at time k. The BMC formula at bound k is thus: unroll(S, k)∧¬P(V @k). A typical approach for BMC starts with k = 0 and incrementally increases it if no counterexample is found at the current bound. A satisfable BMC formula can easily be converted into an execution that violates the property.

Optimization. An *optimization problem* OP is a tuple ⟨t, A, ≼, ϕ, O⟩ where:


I is a solution to OP if σ <sup>I</sup> = A, I |= ϕ, and for any I ′ , such that σ I ′ = A and I ′ |= ϕ:

$$(\mathcal{O} = \operatorname{min} \to t^{\mathcal{T}} \preccurlyeq t^{\mathcal{T}'}) \land \ (\mathcal{O} = \operatorname{max} \to t^{\mathcal{T}'} \preccurlyeq t^{\mathcal{T}}) .$$

A *multi-objective optimization problem* MOP is a fnite sequence of optimization problems {OP1, . . . , OPn} over the same formula ϕ, where OP<sup>i</sup> := ⟨t<sup>i</sup> , A<sup>i</sup> , ≼<sup>i</sup> , ϕ, Oi⟩ and t<sup>i</sup> is of sort σ<sup>i</sup> for i ∈ [1, n]. I is a solution to MOP if σ I <sup>i</sup> = A<sup>i</sup> , I |= ϕ, and for any I ′ , such that σ I ′ <sup>i</sup> = A<sup>i</sup> and I ′ |= ϕ, either: (i) t I <sup>i</sup> = t I ′ i for all i ∈ [1, n]; or (ii) for some j ∈[1, n], t I <sup>i</sup> = t I ′ i for all i ∈ [1, j), and

$$(\mathcal{O}\_j = \min \stackrel{\cdot}{\to} t\_j^{\mathcal{T}} \prec\_j t\_j^{\mathcal{T'}}) \land (\mathcal{O}\_j = \max \stackrel{\cdot}{\to} t\_j^{\mathcal{T'}} \prec\_j t\_j^{\mathcal{T}}),$$

where ≺ is the strict total order associated with ≼.

#### III. CONFIGURATION SOLVING FRAMEWORK

In this section, we formalize the confguration problem and introduce our automated framework for solving it. We also describe how to improve scalability using a modular approach.

#### *A. Problem Formalization*

Suppose we have a confgurable system that we want to use in a particular application context. We assume the application context can precisely defne an input/output relationship that it expects the system to adhere to. The *confguration fnding problem* is then: given a system S and an application-supplied input-output relationship P for S, fnd a confguration C for S such that S satisfes P with confguration C. In this paper, we assume that P specifes behavior for only a fnite number of steps. The rationale is that for many confgurable systems, a segment of a desired execution is suffcient to partially (or fully) determine what the confguration should be. This is the case for the systems we target and for the case study we

Fig. 1: Formal system model.

describe later. More general specifcations are an important direction for future work.

Formally, a confguration problem CP is a tuple ⟨S, k, Vin, Vout, Vconf, P⟩ where:


A *confguration* C is defned as an assignment to the variables in Vconf.

In this paper, we assume the confguration variables Vconf remain unchanged once confgured (a reasonable assumption for many systems, including the one in the case study we present in Section V). We enforce this by explicitly adding an additional *confguration constancy constraint*: conf (Vconf, k) = ⋀ <sup>0</sup>≤i<k Vconf@(i + 1) = Vconf@i. The confguration fnding problem then reduces to checking the satisfability of the *confguration formula*:

$$\begin{aligned} \phi(\mathcal{CP}) &= \operatorname{unroll}(\mathcal{S}, k) \land \operatorname{conf}(V\_{\mathsf{conf}}, k) \land \\ P(V\_{\mathsf{in}}@0, \dots, V\_{\mathsf{in}}@(k-1), V\_{\mathsf{out}}@0, \dots, V\_{\mathsf{out}}@k) \quad (1) \end{aligned}$$

A confguration C is *correct* for CP if there exists an interpretation I such that I |= ϕ and C = I Vconf .

Fig. 2: Confguration solving framework (basic) scheme. CP is a confguration problem. ϕ is a confguration formula.

#### Example 1. *(simple ALU)*

*Let* S := ⟨{x : int, a : int, cfg : Bool}, x = 0, x′ = ite(cfg, x+a, x−a)⟩ *be a transition system in a confguration fnding problem, where* Vin = {a}*,* Vout = {x}*,* Vconf = {cfg}*, and* ite *is the if-then-else operator. There are two ways to confgure* S*: as a system that always adds the current input to the current state, or as a system that always subtracts the current input from the current state. Let us consider two instances of an input-output relation for* k = 2*:*

*1)* P1(a@0, a@1, x@0, x@1, x@2) = a@0 = 1 ∧ a@1 = 1 ∧ x@0 = 0 ∧ x@1 = 1 ∧ x@2 = 2*. We are interested in whether there exists a value of* cfg *which satisfes both the confguration constancy constraint (i.e., remains unchanged) and* P1*. To determine this, we check the satisfability of* unroll(S, 2)∧conf (cfg@0, cfg@1, cfg@2)∧ P1(a@0, a@1, x@0, x@1, x@2)*, which expands to:*

$$\begin{aligned} x \circledast 0 &= 0 \land \\ x \circledast 1 &= ite(cfg \circledast 0, x \circledast 0 + a \circledast 0, x \circledast 0 - a \circledast 0) \land \\ x \circledast 2 &= ite(cfg \circledast 1, x \circledast 1 + a \circledast 1, x \circledast 1 - a \circledast 1) \land \\ cfg \circledast 1 &= cfg \circledast 0 \land cfg \circledast 2 = cfg \circledast 1 \land \\ a \circledast 0 &= 1 \land a \circledast 1 = 1 \land x \circledast 0 = 0 \land x \circledast 1 = 1 \land x \circledast 2 = 2 \end{aligned}$$

*The formula is satisfable when* cfg@0 = True*.*

*2)* P2(a@0, a@1, x@0, x@1, x@2) = a@0 = 1 ∧ a@1 = 1 ∧ x@0 = 0 ∧ x@1 = 1 ∧ x@2 = 0*. For this case, the formula to be checked is:*

$$\begin{aligned} x@0 &= 0 \land \\ x@1 &= ite(cfg@0, x@0 + a@0, x@0 - a@0) \land \\ x@2 &= ite(cfg@1, x@1 + a@1, x@1 - a@1) \land \\ cfg@1 &= cfg@0 \land cfg@2 = cfg@1 \land \\ a@0 &= 1 \land a@1 = 1 \land x@0 = 0 \land x@1 = 1 \land x@2 = 0 \end{aligned}$$

*This formula is unsatisfable, and thus there is no value of* cfg *that satisfes the desired property.*

The framework for the basic scheme just outlined is shown in Figure 2. The input to the framework is a confguration problem. The framework constructs formula (1) and calls a solver to determine whether it is satisfable. The output is either "not confgurable" or the confguration C.

There are two main sources of complexity that limit the scalability of the approach. The frst is the complexity of the

#### Algorithm 1 Modular confguration fnding.

Procedure SOLVEMODULAR Input: (CP1, CP2) a decomposition of CP. Output: a pair (r, C) where if r = sat, then C is a confguration of S 1: ϕ<sup>1</sup> := MAKECP(CP1) 2: (r, I1) := SOLVE(ϕ1), 3: if r = sat then 4: ϕ<sup>2</sup> := MAKECP(CP2) ∧ GETABDUCT(ϕ1, I1) 5: (r, I) := SOLVE(ϕ2) 6: end if 7: return (r, I <sup>V</sup>conf )

design itself, and the second is the bound k required by P. To address design complexity, we propose designing for modular confguration, discussed in more detail in Section III-B below. Designing systems that can be confgured using only small values of k is an interesting research challenge that we plan to investigate in future work.

Another way to improve scalability is by using design knowledge to strengthen the formula ϕ. For example, if a confguration variable must be within a specifc range, then this can be added as a constraint. Any constraint expressible in the language supported by the backend SMT solver can be supported.

#### *B. Modular Confguration*

A natural remedy for design complexity is modular decomposition. Here, we explain a systematic approach for modular confguration, including conditions under which a full confguration can be recovered.

Given CP = ⟨S, k,Vin,Vout,Vconf,P⟩ with S = ⟨V, I, T⟩, we say (CP1, CP2) is a *decomposition* of CP (where CP<sup>i</sup> := ⟨S<sup>i</sup> , k, V <sup>i</sup> in, V <sup>i</sup> out, V <sup>i</sup> conf, Pi⟩ and S<sup>i</sup> := ⟨V<sup>i</sup> , I<sup>i</sup> , Ti⟩ for i = 1, 2) if: (i) T1(V1, V ′ 1 ) ∧ T2(V2, V ′ 2 ) =⇒ T(V, V ′ ); (ii) I1(V1) ∧ I2(V2) =⇒ I(V ); (iii) P<sup>1</sup> ∧ P<sup>2</sup> =⇒ P; and (iv) Vconf ⊆ V 1 conf ∪ V 2 conf.

We now describe a procedure SOLVEMODULAR, presented in Algorithm 1, which, given a decomposition (CP1, CP2) of a confguration problem CP, attempts to solve CP by solving CP<sup>1</sup> and CP2. The call to MAKECP on line 1 constructs the confguration formula for CP1. The call to SOLVE on line 2 invokes a solver to check the satisfability of the confguration formula. If the formula is satisfable, SOLVE returns a pair (sat, I) where I is a satisfying interpretation found by the solver. If the formula is unsatisfable, SOLVE returns a pair (unsat, I) where I is an arbitrary interpretation. Line 4 creates the confguration formula for CP2. The formula is additionally constrained to ensure that the solution for CP<sup>2</sup> still satisfes ϕ1. The call to GETABDUCT returns a formula ψ such that ψ |=<sup>T</sup> ϕ1. The goal is to use the information in I<sup>1</sup> to generate a simple formula for ψ. The approach we take is to fnd a set of sub-terms in ϕ<sup>1</sup> such that, if we constrain them to be equal to their values in I1, this ensures that ϕ<sup>1</sup> is satisfed. In the worst case, we could constrain ϕ<sup>1</sup> itself to be equal to ⊤, which would effectively require solving all of ϕ<sup>1</sup> again at the same time as solving ϕ2. However, in practice, we can do much better. For example, it is often suffcient to let

Fig. 3: Modular decomposition of system S into systems S<sup>1</sup> and S2. V 1 out and V 1 conf are the output and the confguration variables of S1. V 2 in and V 2 conf are the input and the confguration variables of S2. Vconf ⊆ V 1 conf ∪ V 2 conf .

ψ be the formula that assigns the free variables in ϕ<sup>1</sup> to their model values from I1. 1 If the second call to SOLVE succeeds, the result is a correct confguration for CP.

#### Theorem III.1. *(Soundness)*

*If* (CP1, CP2) *is a decomposition of a confguration problem* CP*, and* SOLVEMODULAR(CP1, CP2) *returns a a pair* (sat, C)*, then* C *is a correct confguration of* CP*.*

*Proof.* Let SOLVEMODULAR return (sat, I <sup>V</sup>conf). We prove that I <sup>V</sup>conf is a correct confguration of CP. First, we notice that SOLVEMODULAR returns r = sat iff both calls to SOLVE(ϕ1) and SOLVE(ϕ2) return r = sat. Let (sat, I1) and (sat, I) be the results of SOLVE(ϕ1) and SOLVE(ϕ2), respectively. Let ψ = GETABDUCT(ϕ1, I1). From line 5, I |= ϕ2. Thus, I |= MAKECP(CP2) and I |= ψ. Since ψ |=<sup>T</sup> ϕ1, we also have I |= ϕ1. Consequently, I satisfes: I1, T1(V1@i, V1@(i + 1)) for i ∈ [0, k − 1], conf (V 1 conf, k), and P1. Furthermore, I satisfes: I2, T2(V2@i, V2@(i + 1)) for i ∈ [0, k − 1], conf (V 2 conf, k), and P2. By the defnition of decomposition, then, I satisfes I(V ), T(V @i, V @(i + 1)) for i ∈ [0, k − 1], and P. Finally, from I |= conf (V 1 conf, k), I |= conf (V 2 conf, k), and condition (iv) of the defnition of decomposition (Vconf ⊆ V 1 conf ∪ V 2 conf), it follows that I |= conf (Vconf, k). Thus, I satisfes the confguration formula of CP. Therefore, C := I <sup>V</sup>conf is a correct confguration of CP.

If SOLVEMODULAR returns r = unsat, this does not (in general) imply that CP is unconfgurable. Rather, it may be that the particular decomposition fails, or even that the particular solution found for CP<sup>1</sup> is at fault (and another solution would have succeeded).

However, in practice, we have found that the algorithm works well when the decomposition separates a module into two largely independent parts. An example is shown in Figure 3. Here, the two submodules share only a subset of the confguration variables as well as an interface where outputs of the frst module fow into inputs of the second module.

<sup>1</sup>See the appendix of an extended version of this paper for details on when and why this works [4]. Investigating other possible implementations for GETABDUCT is an interesting direction for future work.

Fig. 4: Optimization-assisted confguration framework. The input is a confguration problem with optional optimization and verifcation objectives. The framework can return: (i) a non-optimal but correct confguration, or (ii) an optimal and correct confguration, or (iii) unsat. ϕ ′ is a conjunction of the confguration formula ϕ and the optional verifcation properties.

#### IV. OPTIMIZATION-ASSISTED CONFIGURATION

A solver can return an unnatural or non-intuitive confguration, complicating the ability of users to understand or maintain the confguration.

We observe that users tend to prefer the simplest confgurations, where the notion of simplest corresponds to minimizing some metric when fnding solutions. To this end, we show how to extend our framework with optimization goals.

Figure 4 depicts our confguration framework extended with support for multi-objective optimization. There are various ways to combine optimization with confguration solving; we depict one approach using iteration. One instance of this approach works as follows: frst a solution is found and the value of the objective term is calculated; then the search space is systematically explored by iteratively constraining the value to be better than the current best value; when no better value can be found, the optimal value has been discovered. There are many different kinds of optimizations that ft this general framework. We present several useful examples in the context of the case study in Section V.

Further extensions. Figure 4 also includes an extension to support combining confguration-fnding with verifcation. In this scheme, any invariants that the system should obey are conjoined to the confguration formula. This ensures that any confguration found satisfes the invariant up to bound k. To check that an invariant holds for all reachable states requires a separate run of an unbounded model checker.

Finding the confguration itself using unbounded model checking is an interesting direction for future work. A signifcant challenge is that this requires writing the input-output property as a single state formula, which may be much harder than writing it as a bounded set of input, output pairs (in much the same way that loop invariants are diffcult to come up with in software). If the input-output property can be written as a state formula P, it may be possible to utilize invariant synthesis techniques by seeking to synthesize an invariant of the form: ⋀ i (V i conf = C i ) =⇒ P, where the left-hand side of the implication contains all confguration variables V i conf ∈ Vconf, and each C i is a constant value to be synthesized.

#### V. CASE STUDY

We present a case study with a course-grained reconfgurable architecture (CGRA) design developed in the Agile Hardware Center at Stanford University [5]. Reconfgurable architectures are appealing because they offer the high performance of hardware with software-like fexibility. CGRAs in particular use sophisticated reconfgurable elements with the aim of narrowing the performance gap with custom ASICs [6].

However, confguring a CGRA is challenging, typically requiring manual effort by an experienced engineer who fully understands the application and the design. To the best of our knowledge, ours is the frst framework that fnds correct CGRA confgurations fully automatically.

In this paper, we focus on confguring a *memory tile* of the CGRA for image processing applications. In these applications data is streamed into the memory tile and must be reordered in various ways before being streamed out. Only the timing and order of the data are changed; the data itself remains the same. Below, we frst describe the memory tile design, then present some specifc applications, and then explain how we automate confguration of the design for these applications.

#### *A. CGRA Memory Tile Design*

The memory tile is a non-trivial design (34998 FF and 164696 gates). Figure 5 shows its architecture . It contains three types of units: *memories*, *addressors*, and *accessors*. Addressors and accessors are reconfgurable units. The accessors control *when* to write or read. The addressors control *where* to write or read. There are three memory modules: an *aggregator* module (AGG), a *static random-access memory* module (SRAM), and a *transpose buffer* module (TB). Each module has an *input accessor* and an *input addressor* associated with it for writes, and an *output accessor* and an *output addressor* for reads. The modules are chained: outputs of AGG are intputs

Fig. 5: Memory tile architecture. All accessors and addressors are included in the *control* box. Red arrows represent data fow. Blue and purple arrows represent addressor and accessor control signals, respectively. Green boxes are local to a single module. Orange boxes are shared between modules. Vconf consists of all accessor and addressor confguration variables.

```
Procedure AFFINESEQUENCE
Input: dim: a value indicating the number of nested loops,
    ranges[dim]: an array of loop bounds, one for each loop,
    strides[dim]: an array of strides, one for each loop,
    ofset: the offset for the address computation
Output: vals[Πiranges[i]]: a set of output addresses
 1: var c[dim]; ▷ Index variables for each loop
 2: var i := 0;
 3: for c[dim − 1] in [0, ranges[dim − 1]) do
 4: ...
 5: for c[0] in [0, ranges[0]) do
 6: vals[i] := Πdim−1
                      j=0 c[j] ∗ strides[j] + ofset;
 7: i := i + 1;
 8: end for
 9: end for
```
Fig. 6: Affne sequence generator using nested loops.

to SRAM, and outputs of SRAM are inputs to TB. Accessors are *shared* between each pair of connected memory modules. Shared accessors act as *schedule generators* for each memory connection. They specify when the data should be transferred and set any required delays between when the data is produced and consumed. Addressors are unique for each module.

The addressors and accessors in the memory tile make use of affne sequence generators to generate sequences of values for reading and writing. Figure 6 shows pseudocode for an affne sequence generator. It takes as input a number dim of loops, an array ranges with bounds for each loop, an array strides with strides for each loop, and ofset which is a base value. It then computes a sequence of outputs, vals, by running dim nested loops, and computing the sum of the offset and the product of each stride with its loop index in the innermost loop. Each of the inputs to the procedure corresponds to a confguration register in the hardware.

While each addressor and accessor contains an affne se-

quence generator, they differ in how they interpret vals. For an addressor, vals contains raw addresses sent to a memory (for either reading or writing). For an accessor, vals contains clock cycle counts that are compared to a running cycle counter to determine when to read or write. Note that an (accessor, addressor) pair should have the same values for their dim and ranges variables to ensure that they produce the same number of values. There are 4 accessors (including 2 shared with SRAM) and 4 addressors for AGG (1 for each memory port). TB has 4 accessors (including 2 shared with SRAM) and 4 addressors (1 for each memory port). SRAM has 2 addressors, and shares 2 accessors with AGG and 2 acessors with TB.

The memory tile processes 16-bit words. However, it uses a 512x64-bit SRAM which stores four 16-bit words at each address. The rationale for this design is to emulate a multiported SRAM while minimizing the energy consumption per memory access [7]. To match the data width at the SRAM interface, AGG and TB implement width converters. AGG implements a *serial-in to parallel-out* (SIPO) converter—serial data is loaded, one 16-bit word at a time, and these are packed into 64-bit outputs. TB implements a *parallel-in to serial-out* (PISO) converter—parallel data is loaded into the PISO as a 64-bit word and is shifted out of the PISO serially, one 16-bit word at a time. The memory tile uses a 2-input and 2-output port architecture to support more throughput. Thus, AGG and TB contain two SIPOs and two PISOs, respectively.

#### *B. Stencil Applications*

We consider a common class of image-processing techniques called *stencils*. Stencil computations usually consist of a multi-stage pipeline, where each stage is a dense linear algebra computation in a local region. So-called *push memories* are inserted between computation units, whose job is to orchestrate the order and the timing of the data explicitly [8]. We explore confguring memory tiles as push memories for four stencil applications:


#### *C. Automating the Memory Tile Confguration*

We decompose the memory tile into three sub-modules (for scalability), following the approach shown in Figure 3. The frst sub-module includes AGG, its input/output accessor/addressor modules, and the MUX (1372 FF, 19676 gates). The second sub-module includes SRAM, both AGG read accessors, and both TB write accessors (33712 FF, 150750 gates). The third sub-module includes TB and its input/output accessor/addressor modules (1126 FF, 18538 gates). Shared accessors contain the shared confguration variables, whose values are propagated to the next module during modular confguration.

In order to confgure each module in the memory tile, we look at the transition system defned by its memory and its accessors and addressors. We then use the "programming by example" approach described above. We specify the inputoutput property P as a sequence of distinct input values (e.g., 1,2,3,. . . ), paired with the corresponding application-specifc desired output sequence based on those values. We then solve for the confguration variables as described in Section III-A above.

As mentioned in Section IV, it is important to generate confgurations that can easily be read and understood. Working together with the designers, we devised a set of optimization objectives that greatly improve the readability of memory tile confgurations. We explain these next. We apply the framework of Figure 4 to confgure and optimize each module separately.

*Objective* 1: we frst minimize the dim variables in the module, since this corresponds to using fewer nested loops and fewer loop counters, resulting in simpler solutions in general. We prioritize minimizing dim variables controlling writes over those controlling reads, as lower write complexity leads to lower read complexity anyway. We formalize this as the following multi-objective optimization problem:

$$\begin{aligned} \mathcal{M}\mathcal{O}\mathcal{P}\_1 &:= \{\mathcal{O}\mathcal{P}\_1, \mathcal{O}\mathcal{P}\_w^1, \dots, \mathcal{O}\mathcal{P}\_w^{d\_w}, \mathcal{O}\mathcal{P}\_r^1, \dots, \mathcal{O}\mathcal{P}\_r^{d\_r}\} : \\ \mathcal{O}\mathcal{P}\_1 &:= \langle \Sigma\_i \dim\_i, A\_{BV}, \prec\_{BV}, \phi, \min \rangle \text{ for } i \in [1, d], \\ \mathcal{O}\mathcal{P}\_w^i &:= \langle \dim\_w^i, A\_{BV}, \prec\_{BV}, \phi, \min \rangle \text{ for } i \in [1, d\_w] \\ \mathcal{O}\mathcal{P}\_r^i &:= \langle \dim\_r^i, A\_{BV}, \prec\_{BV}, \phi, \min \rangle \text{ for } i \in [1, d\_r] \end{aligned}$$

Here, ABV is the domain of bit-vectors (i.e., unsigned machine integers), ≼BV is the usual total order on bit-vector values, d is the number of affne sequence generators in the module, and dim<sup>i</sup> for i ∈ [1, d] are all of the dim variables in the module. These are further partitioned into write dimensionality variables dim<sup>i</sup> <sup>w</sup>, i ∈ [1, dw], and read dimensionality variables, dim<sup>i</sup> r , i ∈ [1, dr], with d<sup>w</sup> + d<sup>r</sup> = d. ϕ is the confguration formula.

*Objective* 2: we minimize the products of the range confguration variables in each loop-nest structure. The objective term corresponds to the aggregate number of reads or writes that occur to a particular memory. By minimizing this number, we eliminate unnecessary reads and writes to the memory. Formally, the optimization problem is:

$$\mathcal{CP}\_2 := \langle \Sigma\_{i=0}^{d-1} \Pi\_{j=0}^{dim\_i - 1} \operatorname{range}\_i[j], A\_{BV}, \preccurlyeq\_{BV}, \phi, min \rangle$$

*Objective* 3: we minimize stride variables to avoid generating confgurations using unnecessarily large addresses.

Many different sets of values for strides could produce the same vals stream in the end, so by choosing the smallest values, we hope to generate the simplest solution. The optimization problem simply minimizes the sum of all stride variables in the module:

$$\mathcal{OP}\_3 := \langle \Sigma\_i \, stride \, s\_i, A\_{BV} \preccurlyeq\_{BV}, \phi, min \rangle.$$

*Objective* 4: we also minimize ofset confguration variables in addressor modules. For addressor modules, minimizing the ofset addressor variable prevents unnecessary offsets, improving the readability of the generated confguration. Note that values of ofset variables in the accessors are fxed by the application. The corresponding problem is as follows, minimizing the sum of all addressor ofset variables in the module:

$$\mathcal{OP}\_4 := \langle \Sigma\_i \vert \mathit{offset}\_i, A\_{BV}, \preccurlyeq\_{BV}, \phi, \mathit{min} \rangle.$$

*Combined objective*: the combined optimization query includes all four objectives and captures the full set of optimization objectives for each module:

$$\mathcal{M}\mathcal{OP}\_{\mathcal{H}} := \{ \mathcal{M}\mathcal{OP}\_1, \mathcal{OP}\_2, \mathcal{OP}\_3, \mathcal{OP}\_4 \}.$$

We solve and prioritize MOP<sup>1</sup> by iteratively increasing the bound on the sum Σidim<sup>i</sup> , and for each bound, trying all possible assignments to the variables, in the order specifed by MOP1. Note that this approach does not directly ft the scheme described in Figure 4, since it does not require fnding a frst solution that is iteratively improved. Instead, it iteratively widens the search space until the frst solution is found.

For the other objectives, we use a branch-and-bound algorithm. First, a solution is found, and the value of the term is calculated; then, the solution space is explored systematically, by iteratively constraining the value of the objective term to be better than the current best value. Each optimal solution is propagated to the next optimiziation objective as a constraint.

#### VI. EVALUATION

Implementation. We have implemented our framework using Pono [11], an open-source SMT-based model checker. Pono is built on Smt-Switch [12], a generic C++ API for interacting with SMT solvers. Pono provides infrastructure for reading in, unrolling, and otherwise manipulating transition systems. We use Boolector [13] as the underlying SMT solver. We convert the memory tile design in our case study from a SystemVerilog representation to its equivalent representation in the Btor2 format [13], which is accepted by Pono. We use Yosys [14], a Verilog synthesis suite, to do the translation. The experimental code is available at https://github.com/StanfordAHA/Confguration/.

Experimental Results. We evaluate our confguration-fnding framework using the memory tile design and the four stencil applications described in Section V. For each application, we generated benchmarks for various input image sizes, from 16x16 to 60x60. For applications that require more than one memory tile (i.e., cascade and harris), we choose one representative confguration problem: conv for Cascade and lxx for Harris (more results appear in the appendix of an extended version of this paper [4]). The number of transitions required for each confguration problem is based on the number of clock cycles it takes to process an image of a given size for a given application.

For each benchmark, we frst run the basic algorithm described in Section III, which fnds the frst satisfying confguration. We try both with and without the modular approach described in Section III-B. We then run our optimizationassisted confguration algorithm (using only the modular approach) as described in Section IV. We run our experiments on a 2x Intel Xeon E5-2620 v4 @ 2.10GHz 8-core 128GB computer. Timeout is set to 4000 seconds. Memory limit is 100 GB.

The results are shown in Figure 7. Each chart shows results for both the basic algorithm (First Confguration) and the optimization-assisted algorithm (Optimal Confguration). Within each of these categories, up to fve different results are shown for each image size: *top* is the time required to confgure the entire design, monolithically; *agg*, *tb*, and *sram* refer to the time required to confgure each of the submodules independently; and *sram agg tb* is the time required to confgure the SRAM module after frst confguring AGG and TB (this is the most effcient order for these modules) and then propagating the shared confgurations from those modules as described in Figure 3. Note that in the modular approach, AGG and TB are confgured independently; thus, the confguration can be performed in parallel, and the total design confguration time is the sum of *sram agg tb* and the maximum of *agg* and *tb*. Timeouts are represented by full bars (up to the timeout limit), and memory outs are represented by omitting the bar completely. We also omit the bar for *sram agg tb* if either AGG or TB is not solved within the given time-memory budget. We make several observations about the results below.

Modular Approach. As the experiments show, the full memory tile is too large to solve within the given time-memory budget—it times out for all image sizes. However, by using the modular approach, we are able to confgure the design for all applications for reasonably useful image sizes. For the Identity Stream, we can confgure for all image sizes (with unroll depths up to 3601) relatively easily using the modular approach. Other applications are more challenging, but we are still able to scale up to images of size 40x40 (and unroll depth up to 1939 clock cycles).

We also observe that the AGG and TB modules take comparable time for the Identity Stream, but for other applications, confguration of the TB module is more challenging. This can be explained as follows. AGG and TB are both twoport designs, comparable in size and complexity. But for all applications, AGG can be confgured by exploiting only a single port, while only the Identity Stream allows a single-port confguration of TB. Thus, we quickly fnd a simple confguration for TB with the Identity Stream, but no comparatively simple confguration exists for the other applications.

Optimal Confgurations. The right-hand side of each chart shows the results of running our optimization-assisted confguration algorithm for each application. There are several interesting observations. First of all, for the AGG and TB modules, fnding optimal confgurations is generally more expensive. However, once these optimal confgurations are found, it is often easier to fnd the corresponding SRAM confguration, suggesting that optimal confgurations may help improve later stages of modular confguration. The total confguration time with optimization is generally comparable to or only slightly worse than the time required to confgure without optimization. Given the value of optimal confgurations in terms of simplicity and readability, these results suggest that modular confguration with optimization may be the best strategy in practice.

#### VII. RELATED WORK

The problem of system confguration has been studied in various formulations and domains, such as software tool confguration, hardware confguration, network confguration, distributed application confguration, and deployment strategies. In one research stream, the confguration problem is to select and arrange a set of components from a given set of assets in order to construct an overall system with a desired specifcation [15]–[18]. Other formulations take as input a confguration database, including confguration variables, and desired requirements to be met [19], [20]. The task is to fnd

(c) Cascade (conv) (d) Harris (lxx)

Fig. 7: Horizontal axis shows image sizes and number of clock cycles required for processing. Vertical axis shows time in seconds.

values for the confguration variables which instantiate the database so that it meets the requested requirement. The work whose problem defnition is closest to ours is [21], which also uses transition systems. The authors defne a confguration as an initial state of a transition system, which is very similar to our notion of confguration variables.

Constraint solving has been explored in various ways for automating system confguration. Efforts have been made to design declarative, constraint-based, object-oriented languages and policy-based tools to confgure systems as well as to validate confgurations [19], [22]–[24]. Early approaches were based on constraint satisfaction and constraint logic programming [18], [25], [26]. More recent approaches utilize SAT and SMT solvers [17], [19], [27], and counterexample-guided inductive synthesis and relational model fnding [21], [28] for dynamic confguration. However, the way these approaches reduce confguration problems to constraint satisfaction problems is signifcantly different from our approach using input/output examples and unrolling.

More signifcantly, our work differs in its use of modularity and optimization to improve scalability and understandability. Some automated confguration efforts do employ optimization (e.g., [29]), but with a different goal, namely to confgure a system in a way that maximizes its performance.

#### VIII. CONCLUSION

We proposed a new approach for automatically confguring systems representable as transition systems. Key contributions of our approach include its ability to leverage modularity and its use of optimization. Optimal confgurations are more human-understandable, and both modularity and optimization can improve scalability. We demonstrated these claims with a case study using a CGRA memory tile.

Future directions for this work include incorporating unbounded model checking, applying the framework to a wider variety of designs, exploring modularity for more sophisticated theories, and fnding provably correct confgurations for applications with repeating input/output patterns.

#### ACKNOWLEDGMENTS

This work was funded in part by the Stanford Agile Hardware Center and by the Defence Advanced Research Projects Agency under grant number FA8650-18-2-7854.

#### REFERENCES


# Towards an Automatic Proof of Lamport's Paxos

Aman Goel *University of Michigan, Ann Arbor* amangoel@umich.edu

*Abstract*—Lamport's celebrated Paxos consensus protocol is generally viewed as a complex hard-to-understand algorithm. Notwithstanding its complexity, in this paper, we take a step towards automatically proving the safety of Paxos by taking advantage of three structural features in its specifcation: *spatial regularity* in its unordered domains, *temporal regularity* in its totally-ordered domain, and its *hierarchical composition*. By carefully integrating these structural features in IC3PO, a novel model checking algorithm, we were able to infer an inductive invariant that identically matches the human-written one previously derived with signifcant manual effort using interactive theorem proving. While various attempts have been made to verify different versions of Paxos, to the best of our knowledge, this is the frst demonstration of an automatically-inferred inductive invariant for Lamport's original Paxos specifcation. We note that these structural features are not specifc to Paxos and that IC3PO can serve as an automatic general-purpose protocol verifcation tool.

*Index Terms*—Distributed protocols, incremental induction, inductive invariant, invariant inference, model checking, Paxos.

#### I. INTRODUCTION

In this paper, we focus on proving the *safety* of distributed protocols like Paxos [1], [2] which form the basis for implementing many effcient and highly fault-tolerant distributed services [3]–[5]. Developed by Lamport, the Paxos consensus protocol allows a set of processes to communicate with each other by exchanging messages and reach agreement on a single value. Verifying the correctness of such a concurrent system requires the derivation of a *quantifed inductive invariant* that, together with the protocol specifcation, acts as an inductive proof of its safety under all possible system behaviors.

Several manual or semi-automatic verifcation techniques based on interactive theorem proving [6]–[9] have been proposed to derive a safety proof for Paxos. Chand et al. [10] formally verifed the TLA+ [11] specifcation of Paxos by manually deriving a proof using the TLAPS proof assistant [7]. Padon et al. [12] used the Ivy [13] verifer, which requires a user to manually refne automatically-generated counterexamples-to-induction, to obtain an inductive invariant for a simplifed version of Paxos in the decidable EPR fragment [14] of frst-order logic. The approaches in [15]– [19] are examples of manually-derived *refnement proofs* [20]– [23] that show how a low-level implementation refnes a high-level specifcation. All these methods, however, require a detailed understanding of the intricate inner workings of the protocol and entail signifcant manual effort to guide proof development.

Karem A. Sakallah *University of Michigan, Ann Arbor* karem@umich.edu

Fig. 1: Hierarchical strengthening of Paxos and its variants. Each level uses all strengthening assertions above that level as input, and outputs the required remaining assertions, altogether inferring the inductive invariant at each level.

In contrast, we propose an approach, implemented in the IC3PO protocol verifer, to *automatically infer the required inductive invariant* for an unbounded distributed protocol by adding three simple extensions to the fnite-domain IC3/PDR [24], [25] incremental induction algorithm for model checking [26]. *Symmetry boosting*, introduced in [27], takes advantage of a protocol's *spatial* regularity to automatically infer quantifed strengthening assertions that refect the protocol's structural symmetries. This paper describes *range boosting* and *hierarchical strengthening* which take advantage, respectively, of a protocol's *temporal* regularity and hierarchical structure, and demonstrates how IC3PO was used to automatically obtain an inductive invariant for Paxos using the four-level hierarchy shown in Figure 1.

Our main contributions are:


level abstractions as *strengthening assertions* to derive the inductive invariant for the detailed lower-level protocol.

– Safety verifcation of *Lamport's Paxos algorithm*, both single- and multi-decree Paxos, through the derivation of a compact, human-readable inductive proof that is automatically inferred using IC3PO, resulting in a drastic reduction in verifcation effort compared to previous approaches [16], [28], [29].

The paper is structured as follows: §II presents preliminaries. §III and §IV describe range boosting and hierarchical strengthening. §V details the four-level hierarchy we used to prove Paxos and §VI is a record of the IC3PO run showing the actual assertions it inferred at each level of the hierarchy. §VII discusses some of the features and interesting details on this automatically-generated proof. Experimental comparisons with other approaches are provided in §VIII and the paper concludes with a brief survey of related work in §IX and a discussion of future directions in §X.

#### II. PRELIMINARIES

#### *A. Notation*

We will use Init, Next, and Safety to denote the quantifed formulas that specify, respectively, a protocol's initial states, its transition relation, and the safety property that is required to hold on all reachable states. We use primes (e.g., φ ′ ) to represent a formula after a single transition step. The notation V !A (resp. S!A, I!A, and P!A) means that assertion A was inferred by IC3PO for the *Voting* (resp. *SimplePaxos*, *ImplicitPaxos*, and *Paxos*) protocol.

As an example, consider a protocol P with two sorts, a symmetric sort aSort and a totally-ordered sort bSort, along with relations p(aSort, bSort) and q(bSort) defned on these sorts. Viewed as a parameterized system P(aSort, bSort), we can specify its fnite instance P(3, 4) as:

$$\begin{aligned} \mathcal{P}(3,4): \qquad \mathsf{a}\mathsf{Sort}\_3 &\stackrel{\triangle}{=} \{\mathsf{a}\_1, \mathsf{a}\_2, \mathsf{a}\_3\} \\ \mathsf{b}\mathsf{Sort}\_4 &\stackrel{\triangle}{=} [\mathsf{b}\_{\mathsf{min}}, \mathsf{b}\_1, \mathsf{b}\_2, \mathsf{b}\_{\mathsf{max}}] \end{aligned} \qquad (1)$$

where aSort<sup>3</sup> represents the fnite symmetric sort of this instance defned as a set of arbitrarily-named distinct constants, while the fnite totally-ordered sort bSort<sup>4</sup> is composed of a list of ordered constants, i.e., bmin < b<sup>1</sup> < b<sup>2</sup> < bmax. This instance can be encoded using twelve p and four q BOOLEAN state variables. A *state* of this instance corresponds to a complete assignment to these 16 state variables, with a total state-space size of 2 <sup>16</sup>. We will use Next ⋀ instead of Next to denote the transition relation of the fnite instance.

#### *B. Clause Boosting and Quantifer Inference*

The basic framework for inferring the quantifed assertions required to prove protocol safety is described in [27]. It extends the fnite IC3/PDR incremental induction algorithm by *boosting* its clause learning during the 1-step backward reachability checks performed through Satisfability Modulo Theories (SMT) [30] solving. Specifcally, a clause φ is learned in (and refnes) frame F<sup>i</sup> if the 1-step query ψ<sup>i</sup> := Fi−<sup>1</sup> ∧Next ⋀ ∧[¬φ ′ ] is unsatisfable. This means that cube ¬φ in frame F<sup>i</sup> is unreachable from frame Fi−1. Boosting refers to: a) "growing" φ to a set of clauses that also satisfy this *unreachability constraint* from frame Fi−1, and b) refning the frame F<sup>i</sup> with the entire clause set instead of just φ. Such boosting accelerates the convergence of incremental induction but, more importantly, makes it possible, under some regularity assumptions, to represent this set of clauses by a *single logically-equivalent quantifed clause* Φ and is the key to generalizing the results of such fnite analysis to unbounded domains.

#### *C. Symmetric Boosting and Quantifer Inference*

Protocols that are strictly specifed in terms of symmetric sorts can be characterized as having *spatial* regularity. For example, the constants in a sort representing a fnite set of k identical processes are essentially indistinguishable *replicas* that can be permuted arbitrarily without changing the protocol behavior. A learned clause φ parameterized by the constants of such a sort can be boosted by permuting its constants in all possible k! ways yielding a set of symmetrically-equivalent clauses, i.e., its symmetry *orbit* φ Sym<sup>k</sup> under the full symmetric group Sym<sup>k</sup> . By construction, all clauses in φ's orbit automatically satisfy the unreachability constraint without the need to perform additional 1-step queries. Furthermore, the quantifed clause Φ that encodes φ's orbit is algorithmically constructed by a syntactic analysis of φ's structure, and can involve complex universal and existential quantifer alternations over both state and non-state (auxiliary) variables. The reader is referred to [27], [31] for the complete details of the connection between symmetry and quantifcation and the procedure for quantifer inference.

### *D. Finite Convergence*

When a boosted fnite incremental induction run terminates, it either produces a fnite counterexample demonstrating that the specifed safety property fails, or produces a set of quantifed assertions A1, · · · , A<sup>n</sup> that yield the inductive invariant inv = Safety ∧ A<sup>1</sup> ∧ · · · ∧ A<sup>n</sup> proving safety for the given fnite size. At this point, an algorithmic *fnite convergence* procedure is invoked to check if the current instance size has captured all possible protocol behaviors and, if not, to systematically increase the fnite instance size until protocol behavior saturates and the cutoff size is reached [32]–[36].

#### III. RANGE BOOSTING

Clause boosting is not limited to clauses that are parameterized by the constants of symmetric sorts, and can be extended to clauses whose literals depend on the constants of totallyordered sorts such as ballot, round, epoch, etc., that are used to model the temporal order of events in a distributed protocol. However, the boosting procedure for such clauses differs from symmetric boosting in two ways: a) the ordering relation between totally-ordered constants must be explicitly preserved, and b) adherence of a boosted clause to the unreachability constraint is not guaranteed and must be explicitly checked with a 1-step backward reachability query.

We extended IC3PO with a *range boosting* procedure that complements its symmetry boosting mechanism, allowing it to transparently handle protocols with both symmetric and totally-ordered sorts.

Let φ be a clause that is parameterized by totally-ordered constants and let φ Ordered denote those variants of φ that are obtained by ordering-compliant permutations of its constants. Clause φ is boosted by making 1-step backward reachability queries on φ Ordered to identify its *safe* subset φ Safe , i.e., those variants that satisfy the unreachability constraint.

For example, consider the following clause φ<sup>1</sup> defned on the fnite instance P(3, 4) from (1):

$$
\varphi\_1 = p(\mathbf{a\_1}, \mathbf{b\_1}) \lor q(\mathbf{b\_2}) \tag{2}
$$

Since φ<sup>1</sup> contains two ordered constants (b1, b2), it has six ordering-compliant variants (bmin, b1), (bmin, b2), (bmin, bmax), (b1, b2), (b1, bmax), and (b2, bmax). However only three of these variants end up satisfying the unreachability constraint yielding the following safe subset of φ Ordered 1 :

$$
\varphi\_1^{Safe} = \left[ \begin{array}{c} p(\mathbf{a\_1}, \mathbf{b\_1}) \lor q(\mathbf{b\_2}) \quad \right] \land \\ \left[ \begin{array}{c} p(\mathbf{a\_1}, \mathbf{b\_1}) \lor q(\mathbf{b\_{\max}}) \end{array} \right] \land \\ \left[ \begin{array}{c} p(\mathbf{a\_1}, \mathbf{b\_2}) \lor q(\mathbf{b\_{\max}}) \end{array} \right] \end{array} \tag{3}
$$

The inferred quantifed clause that encodes these three clauses is now constructed using two universally-quantifed variables X1, X<sup>2</sup> ∈ bSort<sup>4</sup> that replace b<sup>1</sup> and b<sup>2</sup> in φ<sup>1</sup> and expressed as an implication whose antecedent specifes a constraint over the ordered "range" bmin < X<sup>1</sup> < X<sup>2</sup> that must be satisfed by the quantifed variables:

$$\begin{aligned} \Phi\_1 &=& \forall X\_1, X\_2 \in \mathsf{bSort}\_{\mathsf{4}}: \\ (\mathsf{b\_{\min}} < X\_1) &\land (X\_1 < X\_2) \to [\; p(\mathsf{a\_1}, X\_1) \lor q(X\_2) \; ] \; (\mathsf{4}) \end{aligned}$$

In general, a clause that is parameterized by k constants from a totally-ordered domain whose size is greater than k can be range-boosted and encoded by a universally-quantifed predicate with k variables which is expressed as an implication whose antecedent is a range constraint that evaluates to true for just those combinations of the k variables that correspond to safe variants of φ.

This procedure extends easily to the case of multiple totallyordered domains as well, allowing range boosting to be performed independently for each such domain in *any* order since constants from different domains do not interfere with each other.

#### IV. HIERARCHICAL STRENGTHENING

As advocated in [37], hierarchical structuring is an effective way to manage complexity during manual proof development. It can also be easily incorporated in the IC3PO style of invariant generation based on symmetry and range boosting.

Given a low-level specifcation L that implements a highlevel specifcation H , i.e., L ≺ H , hierarchical strengthening starts by automatically deriving strengthening assertions H !A<sup>H</sup> that, together with the safety property H !Safety, proves the safety of H . It then maps and propagates H !A<sup>H</sup> to L, denoted as L!A<sup>H</sup> , and proceeds to prove the strengthened property L!Safety ∧ L!A<sup>H</sup> in L by deriving any additional assertions L!A<sup>L</sup> needed to establish the safety of L. The underlying assumption in this procedure is that proving H is much easier than proving L directly, and that any assertions derived to prove H are also applicable, with suitable mapping, to L. The fnal inductive invariant that proves L will, thus, have the form L!inv = (L!Safety ∧ L!A<sup>H</sup> ) ∧ L!A<sup>L</sup> which can be interpreted as reducing the complexity of L's proof by strengthening its safety property with assertions derived for H .

Such strengthening can be extended to a k-level hierarchy H ≺ M<sup>1</sup> ≺ · · · ≺ Mk−<sup>2</sup> ≺ L, where M<sup>1</sup> to Mk−<sup>2</sup> are suitably-defned intermediate levels between H and L. This, in turn, allows single-level automatic verifcation techniques based on incremental induction, like IC3PO, to scale to complex protocols like *Paxos*, by step-wise verifying higherlevel abstractions frst and using their auto-generated proofs to incrementally build the proof for the lower-level protocol.

#### V. HIERARCHICAL SPECIFICATION OF PAXOS

This section describes in detail the multi-level hierarchical structure of the Paxos protocol, as shown earlier in Figure 1.

#### *A. Lamport's Voting Protocol*

Figure 2 presents the TLA+ [11] description<sup>1</sup> of the *Voting* protocol [38], which is a very high-level abstraction of *Paxos* that formalizes the way Lamport frst thought about the Paxos consensus algorithm without getting distracted by details introduced by having the processes communicate by messages. *Voting* has three unordered sorts named value, acceptor and quorum, and a totally-ordered sort named ballot. The protocol has two state symbols, votes and maxBal defned on these sorts that serve as the protocol's state variables. votes(a, b, v) is true iff an acceptor a has voted for value v in ballot number b. maxBal(a) returns a ballot number such that acceptor a will never cast any further vote in a ballot numbered less than maxBal(a). The global axiom (line 5) defnes the elements of the quorum sort to be subsets of the acceptor sort and restricts them further by requiring them to be pair-wise non-disjoint. Lines 6-9 specify defnitions chosenAt, chosen, showsSafeAt, and isSafeAt, which serve as auxiliary nonstate variables. Protocol transitions are specifed by the actions IncreaseMaxBal and VoteFor (lines 10-11), and lines 12- 14 specify the protocol's initial states, transition relation, and safety property.

<sup>1</sup>Lamport's TLA+ encoding uses sets to denote variables. For example in [38], votes[a] represents the set of votes cast by acceptor a. Throughout this paper, we use an equivalent representation based on relations/functions to enable encoding for SMT solving. ⟨b, v⟩ ∈ votes[a] is equivalently encoded in relational form as votes(a, b, v) = ⊤.

MODULE *Voting* 1 CONSTANTS value, acceptor, quorum <sup>2</sup> ballot <sup>∆</sup>= Nat ∪ {−1} 3 VARIABLES votes, maxBal 4 votes ∈ (acceptor × ballot × value) → BOOLEAN maxBal ∈ acceptor → ballot 5 ASSUME ∧ ∀ Q ∈ quorum : Q ⊆ acceptor ∧ ∀ Q1, Q<sup>2</sup> ∈ quorum : Q<sup>1</sup> ∩ Q<sup>2</sup> ̸= {} 6 chosenAt(b, v) <sup>∆</sup>= ∃ Q ∈ quorum : ∀ A ∈ Q : votes(A, b, v) 7 chosen(v) <sup>∆</sup>= ∃ B ∈ ballot : chosenAt(B, v) 8 showsSafeAt(q, b, v) ∆= ∧ ∀ A ∈ q : maxBal(A) ≥ b ∧ ∃ C ∈ ballot : ∧ (C < b) ∧ (C ̸= − 1) → ∃ A ∈ q : votes(A, C, v) ∧ ∀ D ∈ ballot : (C < D < b) → ∀ A ∈ Q : ∀ V ∈ value : ¬votes(A, D, V ) 9 isSafeAt(b, v) <sup>∆</sup>= ∃ Q ∈ quorum : showsSafeAt(Q, b, v) 10 IncreaseMaxBal (a, b) ∆= ∧ b ̸= − 1 ∧ b > maxBal(a) ∧ maxBal′ = [maxBal EXCEPT ![a] = b] ∧ UNCHANGED votes 11 VoteFor(a, b, v) ∆= ∧ b ̸= − 1 ∧ maxBal(a) ≤ b ∧ ∀ V ∈ value : ¬votes(a, b, V ) ∧ ∀ C ∈ acceptor : (C ̸= a) → ∀ V ∈ value : votes(C, b, V ) → (V = v) ∧ isSafeAt(b, v) ∧ votes′ = [votes EXCEPT ![a, b, v] = ⊤] ∧ maxBal′ = [maxBal EXCEPT ![a] = b] <sup>12</sup> Init <sup>∆</sup>= ∧ ∀ A ∈ acceptor : B ∈ ballot : V ∈ value : ¬votes(A, B, V ) ∧ ∀ A ∈ acceptor : maxBal(A) = −1 <sup>13</sup> Next <sup>∆</sup>= ∃ A ∈ acceptor, B ∈ ballot, V ∈ value : IncreaseMaxBal (A, B) ∨ VoteFor(A, B, V ) <sup>14</sup> Safety <sup>∆</sup>= ∀ V1, V<sup>2</sup> ∈ value : chosen(V1) ∧ chosen(V2) → V<sup>1</sup> = V<sup>2</sup>

Viewed as a parameterized system, the template of the *Voting* protocol is *Voting*(value, acceptor, quorum, ballot). Its fnite instance:

```
Voting(2, 3, 3, 4) :
   value2 , {v1, v2}
   acceptor3 , {a1, a2, a3}
   quorum3 , {q12 :{a1, a2}, q13 :{a1, a3}, q23 :{a2, a3}}
   ballot4 , [bmin, b1, b2, bmax]
```
has three fnite symmetric sorts named value2, acceptor<sup>3</sup> and quorum3, defned as sets of arbitrarily-named distinct constants, while the fnite totally-ordered sort ballot<sup>4</sup> is composed of a list of ordered constants, i.e., bmin < b<sup>1</sup> < b<sup>2</sup> < bmax, where bmin = −1 since −1 is the "minimum" ballot number. The constants of the quorum<sup>3</sup> sort are subsets of the acceptor<sup>3</sup> sort and are named to refect

#### MODULE *Paxos*

	- msg1b ∈ (acceptor × ballot × ballot × value) → BOOLEAN
	- msg2a ∈ (ballot × value) → BOOLEAN
	- msg2b ∈ (acceptor × ballot × value) → BOOLEAN

10 Phase1a(b) ∆= ∧ b ̸= − 1 ∧ msg1a ′ = [msg1a EXCEPT ![b] = ⊤] ∧ UNCHANGED msg1b, msg2a, msg2b, maxBal, maxVBal, maxVal 11 Phase1b(a, b) ∆= ∧ b ̸= − 1 ∧ msg1a(b) ∧ b > maxBal(a)


$$\bigwedge\limits\_{\circ}b\neq\bigupharpoonright\_{\circ}-1\bigwedge\_{\circ}\land v\neq\text{none}\land\quad\neg(\exists\,V\in\mathsf{value}:\,msg2a(b,V))$$

	- ∧ msg2a ′ = [msg2a EXCEPT ![b, v] = ⊤]
	- ∧ b ̸= − 1 ∧ v ̸= none ∧ msg2a(b, v) ∧ b ≥ maxBal(a)
	- ∧ maxBal′ = [maxBal EXCEPT ![a] = b]
	- ∧ maxVBal′ = [maxVBal EXCEPT ![a] = b]
	- ∧ maxVal′ = [maxVal EXCEPT ![a] = v]
	- ∧ msg2b ′ = [msg2b EXCEPT ![a, b, v] = ⊤]

∧ UNCHANGED msg1a, msg1b, msg2a

	- ∨ Phase2a(B, V ) ∨ Phase2b(A, B, V )

```
16 Safety ∆= ∀ V1, V2 ∈ value : chosen(V1) ∧ chosen(V2) → V1 = V2
```

```
Fig. 3: Lamport's Paxos protocol in pretty-printed TLA+
```
their symmetric dependence on the acceptor<sup>3</sup> sort. This instance has 24 votes state variables that return a BOOLEAN and 3 maxBal state variables that return a ballot number in ballot4. A *state* of this instance corresponds to a complete assignment to these 27 state variables.

#### *B. Lamport's Paxos Protocol*

Figure 3 presents the TLA+ description of Lamport's *Paxos* protocol [39], which is a specifcation of the Paxos consensus algorithm [1], [2]. *Paxos* implements *Voting* through the refnement mapping [votes ← msg2b, maxBal ← maxBal], where acceptors now communicate with each other through distributed message passing. State variables msg1a, msg1b, msg2a, and msg2b are used to model the set of different messages that can be sent in the protocol, corresponding to actions Phase1a, Phase1b, Phase2a, and Phase2b respectively. The pair ⟨maxVBal(a), maxVal(a)⟩ is the vote with the largest ballot number cast by acceptor a. The ballot b leader can send a msg1a(b) by performing the action Phase1a(b). Phase1b(a, b) implements the IncreaseMaxBal(a, b) action from *Voting*, where after receiving msg1a(b), acceptor a sends msg1b to the ballot b leader containing the values of maxVBal(a) and maxVal(a). In the Phase2a(b, v) action, the ballot b leader sends msg2a asking the acceptors to vote for a value v that is safe at ballot number b. Its enabling condition isSafeAtPaxos(b, v) checks the enabling condition isSafeAt(b, v) from *Voting*. Phase2b implements the VoteFor action in *Voting*, and enables acceptor a to vote for value v in ballot number b. We refer the reader to [40] for a detailed explanation to understand the internals of *Paxos*.

Represented as a parameterized system *Paxos*(value, acceptor, quorum, ballot), its fnite instance *Paxos*(2, 3, 3, 4) has 132 BOOLEAN state variables, 6 state variables that return a ballot number in ballot4, and 3 state variables that return a value in value2.

#### *C. Intermediate Levels between Voting and Paxos*

We introduced two intermediate levels, *SimplePaxos* and *ImplicitPaxos*, between *Voting* and *Paxos*. These intermediate levels are abstractions of *Paxos*, inspired from the alreadyexisting literature [12], [41]–[44]. *ImplicitPaxos* is inspired from the specifcation of Generalized Paxos by Lamport [41] and uses a commonly-used encoding transformation, as utilized in [12], [43], [44]. Instead of explicitly keeping a track of maxVBal(a) and maxVal(a), *ImplicitPaxos* abstracts them away and implicitly computes their respective values using the history of all votes cast by the acceptor a, i.e., using the history of msg2b from acceptor a, by modifying the Phase1b(a, b) action (line 11 in Figure 3) to as shown in Figure 4.

*SimplePaxos* further simplifes *ImplicitPaxos* and eliminates tracking of the maximum ballot (and the corresponding value) in which an acceptor voted from msg1b completely, i.e., the last two arguments of msg1b are abstracted away. Instead, the history of all votes cast is used to describe how new votes are cast. This is done by replacing the defnition

#### MODULE *ImplicitPaxos*

11 Phase1b(a, b) ∆= ∧ b ̸= − 1 ∧ msg1a(b) ∧ b > maxBal(a) ∧ maxBal′ = [maxBal EXCEPT ![a] = b] ∧ ∃ M<sup>b</sup> ∈ ballot : ∃ M<sup>v</sup> ∈ value : ∧ ∨ ∧ (M<sup>b</sup> = −1) ∧ ∀ B ∈ ballot : ∀ V ∈ value : ¬msg2b(a, B, V ) ∨ ∧ (M<sup>b</sup> ̸= −1) ∧ msg2b(a, M<sup>b</sup> , M<sup>v</sup> ) ∧ ∀ B ∈ ballot : ∀ V ∈ value : msb2b(a, B, V ) → B ≤ M<sup>b</sup> ∧ msg1b ′ = [msg1b EXCEPT ![a, b, M<sup>b</sup> , M<sup>v</sup> ] = ⊤] ∧ UNCHANGED msg1a, msg2a, msg2b


showsSafeAtPaxos (line 8 in Figure 3) with its simplifed form, expressed using msg2b as shown in Figure 5.

#### VI. HIERARCHICAL VERIFICATION OF PAXOS

Using the 4-level hierarchy *Paxos* ≺ *ImplicitPaxos* ≺ *SimplePaxos* ≺ *Voting*, this section is a "log" of how IC3PO automatically derived the required strengthening assertions that established the safety of *Paxos*.

#### *A. Proving Voting*

Using instance *Voting*(2, 3, 3, 4), IC3PO proved the safety of *Voting* by automatically deriving the inductive invariant V !inv , V !Safety ∧ V !A<sup>1</sup> ∧ V !A<sup>2</sup> where

V !A<sup>1</sup> = ∀A ∈ acceptor, B ∈ ballot, V ∈ value : votes(A, B, V ) → isSafeAt(B, V ) V !A<sup>2</sup> = ∀A ∈ acceptor, B ∈ ballot, V1, V<sup>2</sup> ∈ value : chosenAt(B, V1) ∧ votes(A, B, V2) → (V<sup>1</sup> = V2)

In words, these two strengthening assertions mean:


#### *B. Proving SimplePaxos*

Using the refnement mapping [votes ← msg2b, maxBal ← maxBal], IC3PO transformed V !A<sup>1</sup> and V !A<sup>2</sup> to the following corresponding versions for *SimplePaxos*:

$$\begin{aligned} S!A\_1 &= \forall A \in \mathsf{accept}, B \in \mathsf{bal1ot}, V \in \mathsf{value}: \\ msg2b(A, B, V) &\to isSafeAt(B, V) \\ S!A\_2 &= \forall A \in \mathsf{accept}, B \in \mathsf{bal1ot}, V\_1, V\_2 \in \mathsf{value}: \\ chosenAt(B, V\_1) &\land msg2b(A, B, V\_2) \to (V\_1 = V\_2) \end{aligned}$$

These two assertions, passed down from the proof of *Voting*, represented a strengthening of the safety property of *SimplePaxos* that allowed IC3PO to prove it with the inductive invariant S!inv , S!Safety ∧ ⋀ 1≤i≤6 S!A<sup>i</sup> where

$$\begin{aligned} S!A\_3 &= \forall B \in \textbf{ball} \textbf{tot}, \; V \in \textbf{value}: \\ &\quad \begin{aligned} ms2a(B, \; V) &\to isSafeAt(B, \; V) \\ S!A\_4 &= \forall B \in \textbf{ball} \textbf{tot}, \; V\_1, \; V\_2 \in \textbf{value}: \\ &\quad ms2a(B, \; V\_1) \land msg2a(B, \; V\_2) \to (\; V\_1 = \; V\_2) \\ S!A\_5 &= \forall A \in \textbf{accept}, \; B \in \textbf{ball} \textbf{tot}, \; V \in \textbf{value}: \\ &\quad msg2b(A, B, \; V) &\to msg2a(B, \; V) \\ S!A\_6 &= \forall A \in \textbf{accept}, \; B \in \textbf{ball} \textbf{tot}: \\ &\quad msg1b(A, B) &\to maxBal(A) \ge B \end{aligned}$$

are four additional automatically-generated strengthening assertions that express the following facts about *SimplePaxos*:


#### *C. Proving ImplicitPaxos*

All variables from *SimplePaxos* refne to *ImplicitPaxos* as is, except for msg1b that adds explicit tracking of the maximum vote voted by an acceptor in *ImplicitPaxos*. Assertions S!A<sup>1</sup> to S!A<sup>5</sup> map to I!A<sup>1</sup> to I!A<sup>5</sup> in *ImplicitPaxos* as is, while S!A<sup>6</sup> maps as:

$$\begin{aligned} I!A\_6 = \; \forall A \in \mathsf{acceptor}, B, B\_{max} \in \mathsf{bal1ot}, V\_{max} \in \mathsf{value}: \\ msg1b(A, B, B\_{max}, V\_{max}) \to \max\text{Bal}(A) \ge B \end{aligned}$$

These six assertions, passed down from the proof of *SimplePaxos*, represented a strengthening of the safety property of *ImplicitPaxos* that allowed IC3PO to prove it with the inductive invariant I!inv , I!Safety ∧ ⋀ 1≤i≤8 I!A<sup>i</sup> where

$$\begin{aligned} I!A\_7 &= \forall A \in \mathsf{accept}, B, B\_{\max} \in \mathsf{ball} \mathsf{tot}, V\_{\max} \in \mathsf{value}: \\ & \quad \left[ (B > -1) \land (B\_{\max} > -1) \land ms \!g1b(A, B, B\_{\max}, V\_{\max}) \right] \\ & \quad \to \, ms \!g2b(A, B\_{\max}, V\_{\max}) \\ I!A\_8 &= \forall A \in \mathsf{accept}, B, B\_{\min}, B\_{\max} \in \mathsf{ball} \mathsf{tot}, \\ & \quad V, V\_{\max} \in \mathsf{value}: \end{aligned}$$

$$\begin{aligned} \left[ (B > B\_{mid}) \land (B\_{mid} > B\_{max}) \land msg1b(A, B, B\_{max}, V\_{max}) \right] \\ \to \neg msg2b(A, B\_{mid}, V) \end{aligned}$$

are two additional automatically-generated strengthening assertions that express the following facts about *ImplicitPaxos*:


#### *D. Proving Paxos*

All variables from *ImplicitPaxos* refne to *Paxos* trivially, mapping I!A1, . . . ,I!A<sup>8</sup> to P!A1, . . . , P!A<sup>6</sup> in *Paxos* as is. These eight assertions, passed down from the proof of *ImplicitPaxos*, represented a strengthening of the safety property of *Paxos* that allowed IC3PO to prove it with the inductive invariant P!inv , P!Safety ∧ ⋀ <sup>1</sup>≤i≤<sup>11</sup> P!A<sup>i</sup> where

P!A<sup>9</sup> = ∀A ∈ acceptor : maxVBal(A) ≤ maxBal(A) P!A<sup>10</sup> = ∀A ∈ acceptor, B ∈ ballot, V ∈ value : msg2b(A, B, V ) → maxVBal(A) ≥ B P!A<sup>11</sup> = ∀A ∈ acceptor : maxVBal(A) > −1 → msg2b(A, maxVBal(A), maxVal(A))

are three additional automatically-generated strengthening assertions that express the following facts about *Paxos*:


#### VII. DISCUSSION

This section provides a discussion about certain key points and features about the *Paxos* proof from Section VI.

#### *A. Comparison against Human-written Invariants*

Optionally, the inductive invariant P!inv can be minimized to derive a subsumption-free and closed set of invariants, which removes A<sup>1</sup> and A<sup>2</sup> that are subsumed by the conjunction A<sup>3</sup> ∧ A<sup>4</sup> ∧ A5. After this minimization, the inductive invariant of *Paxos* matches identically with the manually-written and TLAPS-checked inductive invariant from [28], guaranteeing its correctness. Similarly, the inductive invariant of *Voting*, i.e., V !inv, matches directly with the manually-written and TLAPS-checked inductive invariant from [45].

#### *B. Benefts of Range Boosting*

Assertions A<sup>6</sup> to A<sup>11</sup> express conditions defned over ordered ranges in the *infnite* totally-ordered ballot domain. Inferring such invariants automatically through IC3PO becomes possible through range boosting (Section III), that extends incremental induction with the knowledge of *temporal regularity* over totally-ordered domains by learning quantifed clauses over ordered ranges.

### *C. Protocol's Formula Structure*

Note that A<sup>1</sup> to A<sup>3</sup> use defnitions isSafeAt and chosenAt, which implicitly enables IC3PO to incorporate learning with complex quantifer alternations. Inspired from previous works on the importance of using derived/ghost variables [36], [46], [47], IC3PO utilizes the *formula structure* of the protocol's transition relation in a unique manner, by incorporating *defnitions* in the protocol specifcation as auxiliary non-state variables during reachability analysis, described in detail in [27]. This provides a simple and inexpensive procedure to incorporate clause learning with complex quantifer alternations.

#### *D. Decidability*

Protocol specifcations at each of the four levels include quantifer alternation cycles that make unbounded SMT reasoning fall into the undecidable fragment of frst-order logic. Unsurprisingly, previous works that rely on unbounded SMT reasoning, like SWISS [48], fol-ic3 [49], DistAI [50], I4 [51], and UPDR [52], struggle with verifying Lamport's Paxos. IC3PO, on the other hand, performs incremental induction and fnite convergence over fnite protocol instances using fnitedomain reasoning that is always decidable.

#### *E. Why a Four-Level Hierarchy?*

The original Paxos specifcation is composed of a two-level hierarchy *Paxos* ≺ *Voting*. Given the two strengthening assertions A<sup>1</sup> and A<sup>2</sup> from *Voting*, inferring the remaining nine assertions for *Paxos* directly in one step of hierarchical strengthening is diffcult, since these two specifcations are too far apart to be proved directly. IC3PO struggled with the large state space of *Paxos* and learnt too many weak clauses involving msg1b, maxVBal and maxVal, eventually running out of memory due to invariant inference getting confused with several counterexamples-to-induction. Table I compares the state-space size of protocol instances at each of the four hierarchical levels. Even though 2 <sup>147</sup> is not huge, especially with respect to hardware verifcation problems [53]– [55], *Paxos* has a dense state-transition graph where statetransitions are tightly coupled with high in- and out- degree, making the problem diffcult for automatic invariant inference with incremental induction based model checking.

Adding *ImplicitPaxos* reduced the complexity in *Paxos* by abstracting away maxVBal and maxVal. Still, scalability remained a challenge due to msg1b, that contributed to 96 out of 147 state bits in *Paxos*(2, 3, 3, 4). Adding another level, i.e., *SimplePaxos*, removed 84 out of these 96 state bits by abstracting away explicit tracking of the maximum vote of


TABLE I: State-space size for fnite instances with 2 value, 3 acceptor, 3 quorum, and 4 ballot

an acceptor from msg1b. When compared against *Paxos*, *SimplePaxos* is signifcantly simpler, with a total state-space size to be just 2 <sup>54</sup> for its fnite instance *SimplePaxos*(2, 3, 3, 4), which led IC3PO to successfully prove *Paxos* automatically using the four-level hierarchy.

#### *F. Extension to MultiPaxos and FlexiblePaxos*

Till now, by *Paxos* we meant *single-decree* Paxos which is the core consensus algorithm underlying the complete Paxos state-machine replication protocol [1], [2], commonly referred to as *MultiPaxos* [43]. In *MultiPaxos*, a sequence of instances execute single-decree *Paxos* such that the value chosen in the i th instance becomes the i th command executed by the replicated state machine. Additionally, if the leader is relatively stable, Phase1 becomes unnecessary and is skipped, reducing the failure-free message delay from 4 delays to 2 delays.

Mapping each of the assertions A1, . . . , A<sup>11</sup> to *MultiPaxos* is trivial, and simply adds the corresponding instance as an additional universally-quantifed argument, e.g., A<sup>11</sup> maps as:

$$\begin{aligned} M!A\_{11} &= \forall A \in \mathsf{accept}, I \in \mathsf{instances}: \\ &\quad \max V \text{Bal}(A, I) > -1 \\ &\quad \to \, ms \, 2 \, b(A, I, \max V \text{Bal}(A, I), \max V \text{al}(A, I)) \end{aligned}$$

Unsurprisingly, the 11 strengthening assertions, passed down from the proof of *Paxos*, together with the safety property of *MultiPaxos*, allowed IC3PO to trivially prove it with no additional strengthening assertions needed, meaning M !Safety ∧ ⋀ <sup>1</sup>≤i≤<sup>11</sup> M !A<sup>i</sup> is already an inductive invariant of *MultiPaxos*. As described in previous works [1], [2], [6], [10], the crux of proving the safety of *MultiPaxos* is based on proving single-decree *Paxos* since each consensus instance participates independently without any interference from other instances. Our experiments validated this further.

Similarly, we also tried another Paxos variant called *FlexiblePaxos* [56], which also verifes trivially with the same inductive invariant, i.e., with no additional strengthening assertions needed.

#### VIII. EXPERIMENTS

IC3PO [57] currently accepts protocol descriptions in the Ivy language [13] and uses the Ivy compiler to extract a logical formulation of the protocol in a SMT-LIB [30] compatible format. To get an idea on the effectiveness of hierarchical strengthening, we also evaluated automatically deriving inductive proofs for EPR variants of Paxos from [12] without any hierarchical strengthening. These specifcations describe Paxos in the EPR fragment [14] of frst-order logic and also incorporate simplifcations equivalent to the ones described for *SimplePaxos* in Section V-C. We performed a detailed comparison against other state-of-the-art techniques for automatically verifying distributed protocols:



TABLE II: Comparison of IC3PO against other state-of-the-art verifers

ORIGINAL problems employ hierarchical strengthening (as detailed in Section VI), while EPR problems do not. Column 2 (labeled S.A.) lists strengthening assertions added through hierarchical strengthening to the safety property (∅ means none). Columns 3-8 (labeled Time) compare the runtime in seconds. For failed SWISS runs, we include the runtime from [48] (indicated with <sup>∗</sup> ). Columns 9-10 (labeled Inv) compare number of assertions in the inductive invariant between IC3PO (with subsumption checking and minimization) and human-written proofs.

Columns 11-12 (labeled SMT) compare total number of SMT queries made by IC3PO versus I4 (until failure for unsuccessful runs).

ative search for a quantifed separator in the space of bounded mixed quantifer prefxes.


All experiments were performed on an Intel (R) Xeon CPU (X5670). For each run, we used a 5-hour timeout and a 32 GB memory limit. All tools were executed in their respective default confgurations. We used Z3 [62] version 4.8.10, Yices 2 [63] version 2.6.2, and CVC4 [64] version 1.8.

#### *A. Results*

Table II summarizes the experimental results. EPR variants were run without any hierarchical strengthening. For ORIGINAL problems, we employed hierarchical strengthening using each tool to verify Lamport's original Paxos specifcation (and its variants) through higher-level strengthening assertions that were automatically generated from IC3PO (as detailed in Section VI). Note that ORIGINAL problems include quantiferalternation cycles that make unbounded SMT reasoning fall into the undecidable fragment of frst-order logic.

IC3PO emerges as the only successful technique that verifes Lamport's Paxos and its variants, and automatically infers the required inductive invariants effciently. Unsurprisingly, none of the other tools (i.e., SWISS, fol-ic3, DistAI, I4 and UPDR) were able to solve ORIGINAL problems since each of these tools rely on unbounded SMT reasoning and struggle on problems that fall outside the decidable EPR fragment of frst-order logic.

#### *B. Discussion*

*Effect of hierarchical strengthening:* Comparing EPR versus ORIGINAL shows the advantages offered by hierarchical strengthening. Even though IC3PO was able to automatically verify EPR versions of single-decree Paxos and fexible Paxos from [12], none of the tools were able to automatically verify the EPR version of multi-decree Paxos. ORIGINAL variants, on the other hand, employed hierarchical strengthening which allowed IC3PO to verify Lamport's Paxos automatically and effciently by using the protocol's hierarchical structure.

*Comparison against other verifers:* DistAI failed on all problems due to unsupported constructs and parsing errors. I4 and UPDR (as well as DistAI) are limited to generating only universally-quantifed invariants over state variables, and hence, were unable to solve any problem. While both IC3PO and I4 use incremental induction over a fnite protocol instance, the number of SMT queries made by I4 grows drastically, indicating the benefts offered by symmetry and range boosting employed in IC3PO. fol-ic3 also fails on all problems, showing limited scalability of its enumerationbased separators technique operating directly in the unbounded domain. For SWISS, we weren't able to replicate results for EPR problems as reported in [48] using our experimental setup. Nevertheless, SWISS showed limited capabilities for solving ORIGINAL problems.

*Comparison against human-written invariants:* As evident from A<sup>1</sup> to A<sup>11</sup> in Section VI, IC3PO generated concise, human-readable inductive invariants. In fact, every invariant of *Paxos* written manually by Lamport et al. (as detailed in [28], [39]) had a corresponding equivalent invariant in the inductive proof automatically generated with IC3PO. In contrast, deriving such invariants manually, even in the presence of a hierarchical structure, is a tedious and error-prone process that demands deep domain expertise [12], [16], [28], [29].

Overall, the evaluation confrms our main hypothesis, that it is possible to utilize the regularity and hierarchical structure in complex distributed protocols, like in Paxos, to scale automatic verifcation beyond the current state-of-the-art.

#### IX. RELATED WORK

Introduced by Lamport, TLA+ is a widely-adopted language for the specifcation and verifcation of distributed protocols [65], [66]. The TLA+ toolbox [67] provides the TLC model checker, which is primarily used as a debugging tool for verifying small fnite protocol instances [68], and not as a tool for inferring inductive invariants. The TLAPS proof assistant [7], [8] allows checking proofs manually written in TLA+, and has been used to verify several distributed protocols, including variants of Paxos [10], [15].

The derivation of inductive invariants for distributed protocols continues to be mostly carried out through refnement proofs using interactive theorem proving [13], [16], [17], [19], [69]–[72], which demands signifcant manual effort and profound domain expertise. The frst attempts at automatically deriving quantifed invariants were reported in [32], [33], using *invisible invariants*. The intuition underlying this method was the assumption that the system is "suffciently symmetric," and that its behavior can be captured by any m-subset of its processes as a universally-quantifed invariant. However, universally-quantifed invariants are not guaranteed to be inductive or to imply the safety property. Spatial regularity was further explored in [73]–[78] to reduce the verifcation of an n-process system to that of a *quotient* system at a small *cutoff* size.

Notwithstanding the undecidability result of Apt and Kozen [79], many efforts to automatically infer quantifed inductive invariants have been reported with the pace increasing in recent years [48], [50]–[52], [80]–[82]. Verifcation of parameterized systems is further explored in [83]–[87]. However, unlike IC3PO, these methods generally do not scale to complex protocols like Lamport's Paxos, since these methods rely heavily on unbounded reasoning and are limited to specifcations in the EPR fragment of frst-order logic.

Our technique builds on these works, with the capability to automatically infer the required quantifed inductive invariant using the latest advancements in model checking, by extending our recent work [27] on symmetry boosting and fnite convergence with range boosting and hierarchical strengthening.

#### X. CONCLUSIONS & FUTURE WORK

We proposed *range boosting*, a novel technique that extends the incremental induction algorithm to utilize the temporal regularity in distributed protocols through quantifed reasoning over ordered ranges. We also presented *hierarchical strengthening*, a simple technique that utilizes the hierarchical structure of protocol specifcations to enable automatic verifcation of complex distributed protocols with high scalability. Given the four-level hierarchy of the Paxos specifcation, we showed that these techniques, coupled with our recent work on symmetry boosting and fnite convergence, provide, to our knowledge, the frst demonstration of an automatically-inferred inductive invariant for the original Lamport's Paxos algorithm.

While introducing *SimplePaxos* and *ImplicitPaxos* to get the four-level Paxos hierarchy was quite easy, these intermediate levels were still added manually. It is appealing to explore counterexample-guided abstraction-refnement (CEGAR) techniques [88], [89] to automatically identify these intermediate levels whenever needed to overcome complexity. Specifcally, investigating how to leverage clause learning feedback from incomplete runs to identify bottlenecks in proof inference and utilizing this information to automatically abstract away irrelevant details from the low-level protocol can help in making the complete procedure automatic end-to-end. We leave this investigation as future work.

Exploring inference with existential quantifers in range boosting can also be an interesting future direction, though intuitively, existential quantifcation over temporal behaviors looks unnecessary for proving safety properties. Future work also includes automatically inferring inductive proofs for other distributed protocols, such as Byzantine Paxos [15], Raft [90], etc., and exploring the verifcation of consensus algorithms in blockchain applications.

# DATA AVAILABILITY STATEMENT AND ACKNOWLEDGMENTS

The software and data sets generated and analyzed during the current study, including all experimental data, evaluation scripts, and IC3PO source code are available at https: //github.com/aman-goel/fmcad2021exp.

We thank Leslie Lamport for the TLA+ video course [91], which shaped several ideas presented in this paper. We thank the developers of TLA+ [92], [93], Yices [63], Z3 [62], pySMT [94], and Ivy [13] for making their tools openly available. We also thank the reviewers for their valuable comments.

#### REFERENCES


# Refnement-Based Verifcation of Device-to-Device Information Flow

Ning Dong , Roberto Guanciale , Mads Dam KTH Royal Institute of Technology

*Abstract*—I/O devices are the critical components that allow a computing system to communicate with the external environment. From the perspective of a device, interactions can be divided into two parts, with the processor (mainly memory operations by the driver) and through the communication medium with external devices. In this paper, we present an abstract model of I/O devices and their drivers to describe the expected results of their execution, where the communication between devices is made explicit and the device-to-device information fow is analyzed. In order to handle general I/O functionalities, both half-duplex (transmission and reception) and full-duplex (sending and receiving simultaneously) data transmissions are considered. We propose a refnement-based approach that concretizes a correct-by-construction abstract model into an actual hardware device and its driver. As an example, we formalize the Serial Peripheral Interface (SPI) with a driver. In the HOL4 interactive theorem prover, we verifed the refnement between these models by establishing a weak bisimulation. We show how this result can be used to establish both functional correctness and information fow security for both single devices and when devices are connected in an end-to-end fashion.

*Index Terms*—Formal verifcation, Refnement, Serial interface, Device driver, Interactive theorem prover, Information fow

#### I. INTRODUCTION

I/O devices are indispensable components for interactions with the external environment (e.g., print documents, transmit data, and receive user's commands). Their proper operation is critical for trustworthiness: Poorly written device drivers are the predominant reason for operating system crashes [1]– [3], and devices themselves can be vulnerable to side-channel attacks [4], [5].

Existing work [6]–[10] mostly focuses on the verifcation of functional properties of device drivers, by analyzing the interactions between the controlling software and the I/O device. In this paper, we present a verifcation approach that includes inter-device communication. This allows to establish end-toend information fow properties, for example to guarantee the absence of side channels.

Our strategy is based on refnement. First we defne a formal "concrete" model of a specifc I/O device, which formalizes the device behavior that is observable by the controlling software and other external devices, and a model of its device driver. The combination of these two models provides a software/hardware subsystem that can interact with other software

This work has been supported by the TrustFull project funded by the Swedish Foundation for Strategic Research. Ning Dong is supported by the China Scholarship Council for his doctoral studies.

components and external devices. We then defne an abstract model of this subsystem, which is independent of the actual device and provides a general blueprint of the subsystem's desired behavior and information fows. The goal is that this abstract model should provide a functionality that is correct and secure by construction, similar to ideal models used in cryptography. Our refnement establishes a weak bisimulation between the concrete and abstract systems.

There are three main benefts of this approach:


We choose the Serial Peripheral Interface (SPI) as the demonstrating example, and we provide the formal model of a specifc device, the Texas Instruments McSPI device used in the AM335x family of processors [12], and its driver. The Serial Peripheral Interface is a synchronous protocol for serial communication that is mainly used in embedded devices. The protocol was frst introduced in the late 1970s by Motorola and has become popular because of its simplicity and speed [13]. SPI devices support both half-duplex and full-duplex data transmissions, where the latter is used to improve performance by simultaneously sending and receiving data with external devices. Although full-duplex is effective in practice, this is to our knowledge the frst example of verifcation in the literature of a full-duplex communication device, cf. [6]–[10].

Fig. 1. The architecture of a random number generator

Commons Attribution 4.0 International License

We use the refnement to establish several interesting properties of the system: (1) The driver never leads the device to enter a confguration that is undocumented by the hardware specifcation; (2) Two interconnected SPI subsystems correctly and securely exchange data when they are activated by their controlling software; (3) Communications (driver-to-device and device-to-device) provide progress-sensitive noninterference at both concrete and abstract levels. The latter is established by a notion of contextual indistinguishability derived from the weak bisimulation.

To demonstrate our results, we developed the demonstrator of Figure 1. We use a BeagleBone Black running the verifed Prosper hypervisor [14] together with an Arducam Shield Mini 2MP Plus camera to capture a physical source of randomness for, in our case, the Verifcatum e-voting system [15]. The two devices communicate using SPI. The verifcation allowed us to slim down the driver by removing some unnecessary device register operations. The driver model is a direct manual translation of the driver binary. Formalization of this step is left as future work. In section X, we discuss our approach to automate this step by establishing a bisimulation between the driver model and its binary.

All proofs and models have been formalized in the HOL4 interactive theorem prover [16], which supports specifcation and proof in classical higher-order logic. For full defnitions and proofs, we refer the reader to https://github.com/kth-step/ sw-spi-cam-model/releases/tag/fmcad.

#### II. BACKGROUND

In this work, we model one of the devices of BeagleBone Black. This is a widely used development board with multiple peripherals, including SPI, I2C, UART, etc. The board has a TI AM335x processor [12] that uses the 32-bit ARMv7 instruction set architecture.

We focus on the SPI subsystem. Figure 2 shows the basic components involved in the SPI protocol: hardware connection, a controller, and a peripheral. In full-duplex mode, SPI permits to transmit and receive data simultaneously on separate data lines, SDI (Serial Data In) and SDO (Serial Data Out). The SPI controller uses the serial clock (SCK) line to maintain synchronization with the peripheral device. During each SPI clock cycle, from the controller's perspective, one bit is transmitted from the controller to the peripheral on the SDO line, while the peripheral sends one bit to the controller on the SDI line. In half-duplex SPI transmissions, only one data line is used depending on the controller settings. In transmissiononly mode, only the SDO data line is used, and vice versa for reception-only. The controller uses the chip select (CS) line to choose the desired communicating peripheral when multiple peripherals are connected. In this paper, we consider only the single peripheral case; extension to multiple peripherals is straightforward.

Bit transmission on the SDO/SDI lines is governed by the controller clock signal SCK, depending on confguration (clock polarity and edge settings). The SPI protocol can transmit messages of normally up to 16 bits, and delegates all error

Fig. 2. Basic SPI connection: a controller and peripheral

detection, fow control, and application adaptation to higherlayer protocols. A driver can interact with the SPI hardware by register polling, interrupts, and Direct Memory Access (DMA). In this work, we rely on polling only. The following registers of the BeagleBone SPI controller are the ones used for polling:


### III. ARCHITECTURAL MODEL

We model devices and drivers as labelled transition systems (LTS) in the style of CCS [17], modelling the interaction between software and driver, driver and device, as well as between devices (through signals "on the wire") by the simultaneous occurrence of an action α and its dual α, where α, α ∈ ∆wr ∪ ∆rd ∪ ∆dev ∪ ∆dr. The top components of Figure 3 summarize the interfaces among models. Here, ∆wr is the set of write operations by the CPU, which is represented by the action wt a v for writing a byte v to the register with the memory-mapped address a, and the dual action wt a v

Fig. 3. The model architecture of SPI subsystem and abstract model

that is the corresponding action of the device. Similarly, ∆rd is the set of read operations by the CPU which is represented by the action rd a v for reading v from the register mapped at address a, and the dual action rd a v. Representing this interaction as a CCS-style synchronous rendez-vous allows to refect the potential side effects of register accesses on the SPI hardware. In the terminology of π-calculus [18], we use the "early" semantics. For instance, the reading of a memorymapped register by the CPU non-deterministically spawns one transition for every possible resulting value.

The device model uses four additional types of action to model device-to-device interactions on the wire. The convention needs to take controller/peripheral asymmetry into account. For transmission-only mode the controller uses tx v to send a byte v over the wire, and in reception-only mode tx v to receive a byte from the wire. For synchronous transfer of the (controller) byte v and (peripheral) byte v ′ , the controller uses xfer v v′ . The peripheral uses the dual actions, i.e., tx v (tx v) for reception (transmission) and always xfer v v′ for synchronous transfer. Let ∆dev = {tx v, tx v, xfer v v′ , xfer v v′ | bytes v, v′}. Finally, the driver uses four additional actions to model invocations of the driver API by application SW and one additional action for returning control and result to SW (collected by ∆dr).

The SPI subsystem consists of the SPI hardware running in parallel with its device driver with internal communication channels (e.g., rd a v), made inaccessible to the external world. In CCS parlance this is (d|s) \ (∆wt ∪ ∆rd), where d and s are states of the driver and hardware, respectively.

#### IV. SPI HARDWARE MODEL

The state of the SPI hardware is represented by a tuple s = (regs, sreg, c). Here, regs is a function mapping addresses of memory-mapped registers to words, and sreg represents the internal hardware-controlled shift register for data transmission and reception. The component c captures the control state of the device and is used to track the progress of its four functionalities: initialization, transmission, reception, and fullduplex synchronous transfer.

With the exception of register RX0, register reads are sideeffect free and simply communicate the current value of the register: i.e., for every state s, s rd a s.regs(a) −−−−−−−−−→ s. Transitions that model register writes (i.e., s wt a v −−−−→ s ′ ) have side effects and are modeled by early instantiating all possible received values. Since many register updates are not atomic and require

Fig. 4. SPI hardware initialization automaton

some time to take effect (e.g., writing into the transmission register does not automatically transfer the byte on the wire), transitions s wt a v −−−−→ s ′ are usually followed by a silent transition s ′ <sup>τ</sup>−→ s ′′, which is the system internal transition that applies the visible side effects.

A special error state ⊥ is entered under the following conditions:


The behavior of transitions that have side effects can be represented by an automaton, which is split into four subautomata for the four device functionalities.

*1) Initialization:* Figure 4 shows the hardware initialization automaton, where the black, red, and blue annotations describe the label, enabling conditions and side effects of transitions respectively. Note that we have omitted all transitions that lead to ⊥ in Figure 4, which applies to the following fgures as well. The initialization is activated when the value 1 is written to the SRST (software reset) bit of the SC (system confguration) register. The τ transition exiting state reset models the hardware completion of the reset operation and sets the SS (system status) register to 1. This register can be used by a driver to detect when the reset process is fnished. In state setregs, the device awaits the set up of the hardware confguration, which is achieved by writing the SC, MC, and CCF registers. This step is necessary before starting data transmissions because the SPI hardware needs basic parameters, like the CP bit of the MC register and the WL bits of the CCF register. If one of these register updates sets a value that does not conform with the specifcation (e.g., the value of WL bits should no less than 3), then the model enters the state ⊥. Once all required registers have been written, the model enters the ready state rdy. Now the SPI can be utilized for data transmissions or be reinitialized.

*2) Synchronous transfer:* Figure 5 depicts the synchronous transfer sub-automaton. From the ready state, the synchronous

Fig. 5. SPI hardware synchronous transfer automaton

transfer is activated when the TRM bits of the CCF register are set to 0. Then, updating CCT with 1 activates the state xfer enb and clears the TXS bit. The following silent transition makes the side effect of enabling the channel visible: the registers TX0 and RX0 are cleared, and the TXS and RXS bits are set to 1 and 0 respectively. From xfer rdy, once the message v to transmit is written to TX0, the TXS bit is cleared. The following silent transition transfers the data from the TX0 register to the shift register and the TXS bit is set internally. The device will now synchronize with an external SPI device, simultaneously transmitting the shift register and receiving one byte v, which is copied into the shift register. The following silent transition makes the communication visible to the driver, by copying the shift register to RX0 and setting the RXS bit. Finally, from the state data rdy, the received data can be fetched by reading the RX0 register. This also resets the RXS bit. The transmission process is repeated until the channel is disabled by writing 0 to the CCT register in the state xfer rdy and then resetting the CCF register to its original value.

As mentioned before, from the diagram in Figure 5, we have omitted all transitions that lead to ⊥. This happens, for instance, if TX0 is written before the TXS bit is set or when the model is in the state data rdy, or if RX0 is read while RXS is not set.

*3) Transmission and reception:* The structure of the halfduplex automata for transmission and reception is similar to the synchronous transfer automaton. However, there are some notable differences:


Fig. 6. Driver initialization automaton

the TX0 register should not be used. A correct driver should wait for the hardware until the received data is ready through reading the RXS bit. The TXS and EOT bits are not applied in the reception automaton.

#### V. SPI DRIVER MODEL

The driver model is a direct manual translation of the real SPI driver binary and interacts with the hardware model using operations on the device registers. The model exposes all accesses to memory-mapped registers that are performed by the actual driver.

The driver state is a tuple d = (b1, b2, idx , last read v, c). Here, b<sup>1</sup> is the transmit, and b<sup>2</sup> the receive buffer. The variable idx points to the next byte in b<sup>1</sup> to be transmitted. The byte last read v is the last returned value from the hardware, used for the driver's internal operations. The last component c is the driver's control state. We defne sub-automata corresponding to each of the four device functionalities.

*1) Driver initialization:* Figure 6 shows the driver initialization automaton. The automaton is invoked by an external call to the driver initialization function, represented here by the action call init. In state init, the automaton writes the SC register to reset the hardware. Then the automaton reads the SS register and updates the d.last read v with the returned value. In the state check stat, the automaton checks the fetched value to determine if the hardware fnished the reset process. If the value is 1, the automaton enters the state setting<sup>1</sup> , otherwise it returns to the previous state and repeats this loop. Finally, the automaton enters the ready state by setting several registers in order (SC, MC, and CCF), indicating that the driver model is prepared to process function calls for data transmissions and reinitialization.

*2) Driver synchronous transfer:* The driver synchronous transfer automaton is shown in Figure 7. With the driver in state rdy, the automaton is invoked by action call xfer with a buffer b<sup>1</sup> copied to the driver's internal output buffer (d.b1). Before starting data transmission, the automaton frst prepares the necessary settings for the hardware by writing the CCF and CCT registers. Notice that CCF is read prior to writing in order to maintain other channel confgurations (e.g., transmission speed). At this point, the automaton loops reading the CST register and checking the TXS bit, as long as the value of TXS is 0. Once the value 1 is read, the automaton enters the state write data. The following step writes the TX0 register with one byte data that is sent to the external device, leading to the state read rxs. Hereafter, the automaton repeatedly reads the

Fig. 7. Driver synchronous transfer automaton

CST register as before but checks the RXS bit rather than the TXS bit, which indicates the hardware transmission is fnished and the received data is available in the RX0 register. If the RXS bit is 1, then the automaton in the state read rx0 issues a read request to the RX0 register. Next, the automaton can fetch the received data and check if all bytes in the output buffer are transmitted. If there are more bytes to transmit, the automaton returns to the state read txs and repeats the process. Otherwise, the automaton clears the CCT register and the CCF register to their initial values. Finally, the driver replies the received data (d.b2) to the program that invoked the driver by using the label reply and returns to the ready state.

The driver's transmission and reception automata are similar and left out.

#### VI. ABSTRACT SPI SUBSYSTEM SPECIFICATION

In this section, we present an abstract specifcation of the combined device and driver subsystem. The model has the same interface as the concrete SPI subsystem (see Figure 3 (b)) and describes the visible effects of the four functionalities (i.e., initialization, full-duplex synchronous transfer, transmission, and reception) while ignoring all internal states of the SPI hardware and the memory-mapped device registers. The state of the abstract model is a pair, a = (t, c). The component t = (b1, b2, idx , v) is the data state, which contains the output and input buffers b<sup>1</sup> and b2, the index of the next byte to be transmitted idx , and the received byte v. The component c is the control state of the abstract model.

The abstract initialization and synchronous transfer automata in Figure 8 are largely self-explanatory. The control structure is the obvious one with bytes in the transmit buffer a.t.b<sup>1</sup> being sent one by one and received bytes getting stored in a.t.b2. Note also that once in the ready state reinitialization must remain enabled.

#### VII. REFINEMENT

The refnement is established by exhibiting a weak bisimulation [19]. This approach is useful to allow multiple levels of concretizations and abstractions through transitivity and compositionality (under parallel) of the corresponding equivalence.

Fig. 8. Abstract initialization and synchronous transfer automata

Below we use p τ ∗ (a) −−−→<sup>1</sup> p ′ to indicate an arbitrary number of τ transitions, optionally followed by an a transition.

Defnition VII.1 (Weak bisimulation). *Given two transition systems* (S, −→1) *and* (T, −→2)*, a binary relation* R ⊂ S × T *is a weak simulation if for every* (p, q) ∈ R*:*


*The relation* R *is a weak bisimulation if both* R *and* R<sup>−</sup><sup>1</sup> *are weak simulations. In the following, we write* S ∼<sup>R</sup> T *when* R *is a weak bisimulation, and* S ∼ T *if there exists* R *such that* S ∼<sup>R</sup> T*.*

Our weak bisimulation defnition is slightly different from the standard defnition that allows arbitrary τ transitions after the observation a (e.g., q τ <sup>⋆</sup>aτ <sup>⋆</sup> −−−−→<sup>2</sup> q ′ ). It is easy to show that our defnition entails the standard one.

Weak bisimulation is transitive and compositional:

Theorem VII.1. *If* S ∼<sup>R</sup><sup>1</sup> T *and* T ∼<sup>R</sup><sup>2</sup> U *then* S ∼<sup>R</sup>1◦R<sup>2</sup> U*, where* p (R<sup>1</sup> ◦ R2) q ⇔ ∃r. p R<sup>1</sup> r ∧ r R<sup>2</sup> q

Theorem VII.2. *If* S ∼<sup>R</sup> T *then* S|U ∼R′ T|U*, where* p|r R′ q|r ⇔ p R q*.*

#### *A. An intermediate model*

In order to show a weak bisimulation between the SPI subsystem and the abstract model A, we introduce an intermediate model B. The intermediate model, still abstracting from memory operations, has the states b = (t, sreg, c) with the control state c as in the abstract model, and with t of the shape (b1, b2, idx ), i.e., as t, but not including the received byte v, which is instead represented in an explicit shift register sreg, as in the SPI hardware model. Figure 9 shows on the top the full-duplex synchronous transfer automaton of the B model, and on the bottom demonstrates in part the weakly bisimilar control states in blue of the SPI subsystem under a relation R1. For example, the control state update of the B model is weak bisimilar with two states of the SPI subsystem, (check rxs|update) and (read rxs|update) (driver and hardware's control states respectively). The control state (check rxs|update) is reached from the (read rxs|update) by reading the CST register, which is omitted in the B model. The τ transitions between two control states that are weakly bisimilar with the same abstract state are also ignored. In our example, if the RXS bit is 0 when the SPI hardware

Fig. 9. Model B synchronous transfer automaton and part weak bisimulation

is in the control state update, the driver will return to the previous state by internally checking the fetched value. This stepwise approach makes it much easier to build the desired bisimulation relation.

#### *B. Weak bisimilarity of the abstract and SPI models*

The following two lemmas show the weak bisimilarity of B and SPI models, A and B models respectively.

*1) Weak bisimilarity of the intermediate and SPI models:* We defne a relation R<sup>1</sup> for the B and SPI models, which matches their control states as indicated in Figure 9 and requires the equivalence of data buffers and records, shift registers, etc. In addition, the relation R<sup>1</sup> requires that if b is not in the error state then neither are the driver and hardware models, and vice versa.

Lemma VII.1. (d|s) \ {∆wr ∪ ∆rd} ∼<sup>R</sup><sup>1</sup> b

*Proof:* The two models have the same four functionalities, and the state transitions of the two models can be divided into the corresponding four sub-automata. We comment on the full-duplex synchronous transfer automaton, since the transmission and reception are similar and the initialization is straightforward. There are four kinds of transitions in this automaton for both models: call xfer buf , xfer v v′ , τ and reply buf ′ .

	- 1) The driver should delay writing the TX0 register until the TXS bit is 1, because the value 0 of TXS bit means the TX0 register is not ready to be written.

Fig. 10. Weak bisimulation example of the A and B models

This also means the driver should not immediately write the next byte after legally writing the TX0 register.


*2) Weak bisimilarity of the abstract and intermediate models:* The relation R<sup>2</sup> is defned in a similar way for the abstract and intermediate models. Figure 10 shows the relation for a part of the synchronous transfer automata of the two models, where weakly bisimilar control states are coloured identically. This relation basically matches control states under the requirement that buffers and records remain unchanged. The bisimulation condition forces input and output data of the two models to be the same.

# Lemma VII.2. b ∼<sup>R</sup><sup>2</sup> a

*Proof:* Same methodology as for Lemma VII.1. From Theorem VII.1, Lemma VII.1 and Lemma VII.2, it directly follows that there is a relation R<sup>3</sup> for the abstract and SPI models:

Theorem VII.3. (d|s) \ {∆wr ∪ ∆rd} ∼<sup>R</sup><sup>3</sup> a *where* R<sup>3</sup> = R<sup>1</sup> ◦ R<sup>2</sup>

#### VIII. SYSTEM PROPERTIES

In order to demonstrate the functional properties of the system, we verify three theorems for the abstract model. These theorems transfer easily to the concrete models using the bisimulation results of Section VII. Additionally, we show that the abstract (SPI subsystem) model never enters the error state.

The functional correctness of full-duplex synchronous transfer should show that buffers are exchanged correctly between two devices. To show this property, we defne the process G(a0, a1) = (a0|(a1{xfer v v′/xfer v ′ v})) \ ∆dev, which composes the abstract model of an SPI subsystem with a "dual" paired device: if one controller device uses xfer v v′ to transmit and receive data, the peripheral device uses the dual

Fig. 11. Composition of two devices

label to synchronize. Figure 11 depicts the composition of two devices.

Theorem VIII.1 shows the functional correctness of the fullduplex synchronous transfer. Notice that buffers must have the same length, otherwise the larger buffer cannot be transmitted in its entirety.

Theorem VIII.1. *If* 0 < |b0| = |b1|, (t0, rdy) call xfer <sup>b</sup><sup>0</sup> −−−−−−−→ a0*, and* (t1, rdy) call xfer <sup>b</sup><sup>1</sup> −−−−−−−→ <sup>a</sup>1*, then* <sup>∃</sup>n a′ <sup>0</sup> a ′ <sup>1</sup> a ′′ <sup>0</sup> a ′′ 1 . G(a0, a1) ( <sup>τ</sup>−→) <sup>n</sup> G(a ′ 0 , a′ 1 ) ∧ a ′ 0 reply b<sup>1</sup> −−−−−→ a ′′ <sup>0</sup> ∧ a ′ 1 reply b<sup>0</sup> −−−−−→ a ′′ 1

*Proof:* We show that the frst byte can be exchanged correctly and then complete the proof by induction.

An analogous theorem shows the correctness of transmission/reception. In this case, l, the number of bytes to be received, should be greater than or equal to the length of the data buffer b0, otherwise extra data of the buffer will be lost.

Theorem VIII.2. *If* 0 < |b0| ≤ l*,* (t0, rdy) call tx b<sup>0</sup> −−−−−−→ a0*, and* (t1, rdy) call rx l −−−−−→ a1*, then* ∃n a′ <sup>0</sup> a ′ <sup>1</sup> a ′′ 1 . G(a0, a1) ( <sup>τ</sup>−→ ) <sup>n</sup> G(a ′ 0 , a′ 1 ) ∧ a ′ 1 reply b<sup>0</sup> −−−−−→ a ′′ 1

Finally, we show that the abstract model can never enter an erroneous state. The bisimulation transfers this property to the SPI hardware and the driver:

Theorem VIII.3. *If* c ̸= ⊥ *and* (t, c) → (t ′ , c′ )*, then* c ′ ̸= ⊥

#### IX. INFORMATION FLOW SECURITY

Formal device and driver verifcation projects have generally focused on functional correctness [6]–[10]. However, the device driver can possibly leak sensitive information and therefore, for critical applications, information fow analysis is needed. One of the main benefts of establishing weak bisimulation instead of a simulation is that the former guarantees that two systems have the same information fows (up to channels that are not modeled here, like timing). We show that weak bisimilarity is suffcient to capture progress-sensitive noninterference (PSNI), in the sense of Hedin and Sabelfeld [11]. Let E be the set of transition labels of the system under consideration. In our case, we may consider a system as in Figure 11 with E = ∆dr ∪ ∆′ dr, where ∆dr and ∆′ dr are distinct driver interfaces that are both high, since the interfaces are used to communicate sensitive data. We assume a context C that is allowed to interact with the system using any label in E. This context is additionally equipped with a public, distinguished interface of labels P that the context can use to receive and produce publicly observable stimuli. Then, any observations using labels in P that can cause the abstract and concrete models to be distinguished must be due to C being able to bring the two systems to states that C can distinguish. Of course, if the two systems are weakly bisimilar, this is in fact not possible, motivating the following defnition.

Defnition IX.1 (Contextual indistinguishability). *Two states* s<sup>1</sup> *and* s<sup>2</sup> *are contextually indistinguishable,* s<sup>1</sup> ≈ s2*, if for every context* C*,* (s<sup>1</sup> | C)\E ∼ (s<sup>2</sup> | C)\E*.*

We use the term contextual indistinguishability instead of contextual equivalence, as the former considers only contexts of very specifc shapes. It is not the case that contextual indistinguishability implies contextual equivalence in general, as the latter is a congruence, specifcally under CCS sum, which is former is not. However, weak bisimulation *is* a congruence under parallel composition and restriction. Thus, if s<sup>1</sup> and s<sup>2</sup> are weakly bisimilar, then they are also contextually indistinguishable. The converse implication, of course, does not hold. It also follows directly that ≈ is transitive.

The concept of contextual indistinguishability is related to Focardi et al.'s nondeducibility of composition (NDC) [20], which in our setting would be the condition (s | C)\H ∼ s\H on s , where H represents the high labels and C is restricted to interact using only H. However, it is not clear how to adapt the NDC condition to our refnement-based setting, and also, in contrast to contextual indistinguishability, the NDC condition is not able to accommodate systems such as ours that obtain low observability only through the use of the context.

For the defnition of PSNI, a *run* π is any sequence of transitions starting from an initial state. Such a run is *complete* if it cannot be extended, i.e., it is either unbounded or ends in a fnal state. For a run π, we let O(π) be the list of public labels in π. We can now defne PSNI adapted to our setting of reactive systems as follows:

Defnition IX.2 (PSNI). *Two states* s<sup>1</sup> *and* s<sup>2</sup> *are PSNI, if for every complete run* π<sup>1</sup> *starting from* s1*, there exists a complete run* π<sup>2</sup> *starting from* s<sup>2</sup> *such that* O(π1) = O(π2)*, and vice versa.*

The defnition can be seen to be equivalent to the one in [11], or in terms of termination only, with the notion of weakly termination-sensitive noninterference of [21] <sup>1</sup> .

Contextual indistinguishability is a suffcient condition for PSNI, because it guarantees the existence of traces for two transition systems with the same observable labels.

#### Theorem IX.1. *If* s ≈ t*, then* s *and* t *are PSNI.*

If s and t are not PSNI, then we fnd a complete run π<sup>1</sup> from s such that all complete runs π<sup>2</sup> starting from t have different low observations from π1. Clearly, this allows a context c using labels in L ∪ H to steer s, possibly nondeterministically, into a state s ′ that cannot be matched by t, in the sense of weak bisimilarity. Here L represents low labels.

<sup>1</sup> In fact, at our low level of modelling, with weak bisimulation, the adversary does not have any model-external means (such as exhausting the memory) at its disposal to prevent progress. Hence our account is also strongly termination-sensitive in the terminology of [21].

Fig. 12. Information fow security example

We can also show that PSNI transfers under ≈:

Theorem IX.2. *Suppose* s ≈ s ′ *and* s ′ *and* t *are PSNI. Then* s *and* t *are PSNI.*

We cannot in general replace weak bisimulation by the corresponding notion of simulation in the defnition of contextual indistinguishability. A device driver may leak a sensitive boolean s by either terminating execution conditionally on s or by entering a diverging loop (e.g., while (s) {}), but still be (weakly) simulated by the abstract model. In this case, an external attacker may discover the value of the secret boolean by observing the impossibility of transmission of a buffer.

Also, establishing bisimulation allows to compose the system with non-deterministic components safely. For instance, we can introduce a faulty communication medium (MED) between two devices that can indeterminately deliver wrong values. Figure 12 (A) represents the abstract model where two abstract devices (our A model) are connected through the given medium. As a result of the medium, the fnal output of the abstract model is non-deterministically v or v ′ . The compositionality of the weak bisimulation guarantees that in the system where the two concrete SPI subsystems are interconnected by the same medium (see Figure 12 (B)), the fnal output is also non-deterministically v or v ′ : the system has the same information fows. On the other hand, the system (Figure 12 (C)), where the receiving device driver decides the value according to a secret value, leaks a secret value via the fnal output. This model cannot be validated using contextual indistinguishability, but it can be when weak bisimulation is replaced by a corresponding notion of weak simulation.

# X. APPLICATION: SECURING A RANDOM NUMBER GENERATOR USING SPI

As a demonstrating application, we developed a secure random number generator (RNG) that relies on the SPI hardware for sourcing entropy. The architecture of the system is depicted in Figure 1. The blue components are the software components not including the SPI driver(s). The SPI driver interacts with the SPI hardware through operations on memory-mapped registers (∆rd and ∆wr). We use a BeagleBone Black to connect with an Arducam Shield Mini 2MP Plus camera through SPI. The RNG captures images of the foating material in a lava lamp. This has been shown to be a good source of physical randomness [22], [23].

In order to prevent vulnerabilities of other software affecting the RNG, we develop a bare-metal application that integrates the SPI driver and that is executed on top of the Prosper hypervisor [14]. This is a hypervisor for ARMv7-A processors that provides provable separation between different guests and can be confgured to grant accesses to the SPI registers to a dedicated partition only, running our driver. This allows an untrusted partitioned Linux guest (such as in our case, the Verifcatum e-voting application [15]) to harden the built-in Linux RNG with physical randomness through a hypercall interface provided by the hypervisor with strong end-to-end security guarantees. In this scenario, the SPI subsystem plays an important role. Additionally to failing to function, a faulty device driver may reduce the entropy of the system by simply returning predictable buffers or it could communicate, directly or indirectly, internal data to the external device. Formal verifcation of the driver model allows us to rule out these problems. Moreover, it helped to identify redundant operations of the driver. For example, the initial version (extracted from the u-boot library) sets up the WL bits of the CCF register whenever the transmission functions are used, however it is enough to set them once in the initialization function.

In order to guarantee the absence of vulnerabilities at the code level, the refnement should be pushed down to the binary code of the device driver. We extract the driver model by manual inspection of the driver binary. This step has yet to be formalized. We don't view this as a major weakness, however, given that the memory-mapped registers use uncached memory only. We have experimented with the usage of the binary analysis tool HolBA [24] for verifying weak bisimilarity of the driver's assembly code and the driver model. The weak bisimulation relates fragments of binary instructions (i.e., program counter addresses) to a state of the driver's automaton. Each fragment has a single entry point, and either (1) consists of one single instruction accessing a device register or (2) does not access the device. In the former case, the instruction directly corresponds to a transition of the driver model. In the latter case, the fragment corresponds to a fnite sequence of silent transitions. We then translate the relation into pre/post conditions for the fragments, which can be analyzed via HolBA weakest precondition tool and a Satisfability Modulo Theories (SMT) solver.

#### XI. RELATED WORK

Some previous work has applied the bisimulation methodology for verifcation in a theorem prover context [25], [26]. For example, Rockl et al. [25] verifed the correctness of several ¨ communication protocols by proving weak bisimilarity. We prove the equivalence of the abstract and SPI models using the same approach.

Several projects of formal verifcation of low-level software have focused on the operating system (OS), like seL4 [27] and CertiKOS [28]. However, the functional correctness of device drivers usually is not considered. For example, the seL4 microkernel [27] only guarantees the isolation of device drivers located in the user space, where the correctness of drivers is ignored. CertiKOS [28] initially did not verify the drivers as well. Based on CertiKOS, Chen et al. [10] developed a verifed interruptible operating system with device drivers. They proposed a general device model with several instantiations and a realistic formal model of device interrupts. Although their device model has similarities with the one presented here, there are notable differences:


Other previous work on verifying the functional correctness of device drivers studied various I/O devices, like UART [7], hard disk [8], and USB OHCI [6]. In their work, there is no abstract I/O device model to represent the general behaviours of different I/O devices, and it is too restrictive to extend their work on other hardware devices. Duan et al. [9] proposed an abstract device model that is plugged into the formal model of ARMv4 instruction set architecture and later extended it to support interrupts with respect to the ARMv7 architecture [29]. However, the device state is merged into the machine state in their model, which requires to carefully handle the interleavings between the execution of the device and processor. Because of the complexity, it is diffcult to apply their model to verify I/O devices.

#### XII. CONCLUSION AND FUTURE WORK

We modeled and verifed an SPI subsystem that consists of the device hardware and its driver. The verifcation establishes a weak bisimulation between this model and an abstract specifcation, which is used to transfer functional and information fow properties of the abstract model to the concrete one.

Our methodology can be reused to verify other SPI subsystems by establishing a refnement with the abstract model presented in this paper. There are some valuable lessons we have learned from this project:


In order to complete the binary verifcation of the device driver, we plan to follow the strategy of Section X, which establishes a bisimulation between the SPI driver model and its binary code using contract-based verifcation of the HolBA platform [24]. Moreover, we are planning to address two limitations of the current models: The absence of DMA and interrupts. While these can be encoded via explicit synchronizations processor/device-memory or processor-device, we think that explicit treatment of these features can simplify models and proofs [30]. Currently, our models are shallowly embedded in HOL4. This allows us to partially automate our proof via the HOL4 standard tactics. For example, large parts of the proof search are fully automated using METIS TAC. Our work can give insight for deeply embedding the models in HOL4. This can provide a general framework for modeling multiple types of I/O devices and increase automation by implementing decision procedures for checking bisimilarity.

Finally, our information fow analysis does not deal properly with side channels. How to do this is an open challenge, even for uncached memory, as here. For instance, precisely modelling timing is infeasible for real systems since we do not have accurate timing information of the underlying hardware. A more successful strategy consists in defning abstract leakage models in the form of observations (e.g., accessed memory addresses affect caches that in turn affect the timing) and preventing timing side channels by proving observational equivalence. We are currently working on validating [31] such models and defning methodologies to handle different side channels at each refnement step [32].

#### REFERENCES


# Celestial: A Smart Contracts Verifcation Framework

Samvid Dharanikota\* *Microsoft Research India* Bangalore, India samvid.dharani@gmail.com

Suvam Mukherjee\* *Microsoft Corporation* Redmond, USA sumukherjee@microsoft.com

Chandrika Bhardwaj# *Goldman Sachs* Bangalore, India chandrika.bhardwaj@gs.com

Aseem Rastogi *Microsoft Research India* Bangalore, India aseemr@microsoft.com

Akash Lal *Microsoft Research India* Bangalore, India akashl@microsoft.com

*Abstract*—We present CELESTIAL, a framework for formally verifying smart contracts written in the Solidity language for the Ethereum blockchain. CELESTIAL allows programmers to write expressive functional specifcations for their contracts. It translates the contracts and the specifcations to F<sup>⋆</sup> to formally verify, against an F<sup>⋆</sup> model of the blockchain semantics, that the contracts meet their specifcations. Once the verifcation succeeds, CELESTIAL performs an erasure of the specifcations to generate Solidity code for execution on the Ethereum blockchain. We use CELESTIAL to verify several real-world smart contracts from different application domains. Our experience shows that CELESTIAL is a valuable tool for writing high-assurance smart contracts.

*Index Terms*—Smart contracts, Blockchain, Reliability, Testing

# I. INTRODUCTION

Smart contracts are programs that enforce agreements between parties transacting over a blockchain. Till date, more than a million smart contracts have been deployed on the Ethereum blockchain with applications such as digital wallets, tokens, auctions, and games, holding digital assests worth over \$200 billion [19].

The most popular language for smart contract development is Solidity [20]. Solidity contracts are compiled to Ethereum Virtual Machine (EVM) bytecode for execution on the blockchain. Unfortunately, Solidity has obscure operational semantics understood only partially by most programmers. This often leaves vulnerabilities in the smart contracts. Repeated high-profle attacks (e.g. TheDAO [17] and ParityWallet [18] attacks) orchestrated around these vulnerabilities have resulted in fnancial losses running into millions of dollars. Worse, smart contracts are "burned" into the blockchain on deployment, which does not allow subsequent patches to fx the vulnerabilities. As a result, it is necessary to ensure correctness at the time of deployment.

Smart contracts are relatively small pieces of code with simple data-structures [29]. All these qualities combined their critical nature, immutability after deployment, and small

#Work done during an internship at Microsoft Research India.

size—make smart contracts a good ft for formal verifcation. The challenge, however, is to lower the formal verifcation entry barrier for smart contracts developers.

Towards that goal, we present CELESTIAL§ , an open-source framework for developing formally verifed smart contracts. CELESTIAL allows programmers to annotate their Solidity contracts with Hoare-style specifcations [32] capturing functional correctness properties. The contracts and the specifcations are translated to F<sup>⋆</sup> [45], which in an *automated manner*, proves that the contracts meet their specifcations. Once F<sup>⋆</sup> returns a verifed verdict, CELESTIAL erases the specifcations from the input contracts, and emits Solidity code that can be deployed and executed on the Ethereum blockchain. By using Solidity as the source language, and providing fully-automated verifcation, CELESTIAL ensures a low entry barrier for smart contract developers.

F ⋆ is a proof assistant and program verifer with a fully dependent type system. We fnd it suitable for smart contract verifcation for several reasons. First, it provides SMT-based automation which, as we show empirically, suffces for fullyautomated verifcation of real-world smart contracts. Second, F ⋆ supports user-defned effects, allowing us to work in a custom state and exception effect [21] modeling the blockchain semantics. Finally, F<sup>⋆</sup> supports expressive higher-order specifcations, though we use its frst-order subset with quantifers and arithmetic (adding our own libraries for arrays and maps).

We evaluate CELESTIAL by verifying several real-world Solidity smart contracts that together currently hold millions of dollars of fnancial assets. The contracts span different application domains including tokens, wallets, and a governance protocol for Azure Blockchain. We studied the contracts (and in some cases, discussed with the developers) to design their specifcations and formally verifed that the contracts meet those specifcations. In the process, we uncovered bugs in some cases (e.g. missing overfow checks), manifesting as F <sup>⋆</sup> verifcation failure. Once we fxed those bugs (e.g. by adding runtime checks), F<sup>⋆</sup> was able to successfully verify the contracts in all the cases. The overhead of any additional

§https://github.com/microsoft/verisol/tree/celestial/Celestial

\*Equal contribution

Fig. 1: Architecture of the CELESTIAL framework.

Fig. 2: A simple blockchain based e-commerce application.

instrumentation, which was required for correctness, was at most 20% in terms of gas consumption.

Summarizing our main contributions:


#### II. OVERVIEW

The high-level architecture of CELESTIAL is outlined in Figure 1. A CELESTIAL project is a set of contracts (e.g. C1, C2, etc. in the fgure) written in Solidity. These contracts may be annotated with functional specifcations encoding properties of interest. CELESTIAL provides two kinds of translations for these contracts. The frst one translates the contracts and their specifcations to F<sup>⋆</sup> [45], a dependently-typed functional programming language designed for program verifcation. F<sup>⋆</sup> , using a model of the blockchain semantics (Section III), verifes that the contracts meet their specifcations. A second translation simply erases all specifcations to emit vanilla Solidity contracts. In this section, we use a simple application (Section II-A) to describe the specifcation language of CELESTIAL (Section II-B). We discuss the verifcation scope and limitations of the framework later in Section II-C.

#### *A.* SIMPLEMART

Consider a simple blockchain-based e-commerce application SIMPLEMART from Figure 2. The application contains a SimpleMarket contract (Listing 1) which interacts with one or more buyers and sellers that may either be smart contracts themselves or externally-owned accounts. A seller registers an item for sale by invoking the sell method of SimpleMarket, with the price as argument. In response, SimpleMarket creates an instance of the Item contract, which holds metadata about the new item available for sale. It

```
1 contract SimpleMarket {
 2 mapping ( address = > uint ) sellerCredits ;
 3 mapping ( address = > Item ) itemsToSell ;
 4 uint totalCredits ;
 5 event eNewItem ( address , address ) ;
 6 event eItemSold ( address , address ) ;
 7
 8 function sell ( uint price ) public
 9 returns ( address itemId ) {
10 Item item = new Item ( address ( this ) ,msg. sender , price ) ;
11 itemId = address ( item ) ;
12 itemsToSell [ address ( item ) ] = item ;
13 emit eNewItem (msg . sender , itemId ) ;
14 }
15 function buy ( address itemId ) public payable
16 returns ( address seller ) {
17 Item item = itemsToSell [ itemId ];
18 if ( item == null ) { revert (" No such item ") ; }
19 if ( msg. value != item . getPrice () )
20 { revert (" Incorrect price ") ; }
21 seller = item . getSeller () ;
22 totalCredits = safe_add ( totalCredits , msg. value ) ;
23 sellerCredits [ seller ] =
24 sellerCredits [ seller ] + msg. value ;
25 delete ( itemsToSell [ itemId ]) ;
26 emit eItemSold (msg. sender , itemId ) ;
27 }
28 function withdraw ( uint amount ) public {
29 if ( sellerCredits [msg . sender ] >= amount ) {
30 msg. sender . transfer ( amount );
31 sellerCredits [msg. sender ] -= amount ;
32 totalCredits -= amount ;
33 } else { revert (" Insufficient balance ") ; }
34 }
35 }
```
Listing 1: The SimpleMarket Solidity contract

also emits an event (eNewItem) informing the seller about the idenity (in this case, the address) of the new item. A buyer may purchase an item by invoking the buy method of SimpleMarket, passing the item address as an argument, along with the ether amount matching the item price. If the item has not been sold already, SimpleMarket records the sale in its state, which involves adding the ether towards the total sales proceeds for the respective seller and marking the item as being sold. The seller may then withdraw the ether from SimpleMarket via the withdraw method.

Functional correctness of the buy method requires that if a buyer initiates buy with a valid item and price, then the item is sold and the seller sales proceeds are credited, leaving all other sellers' proceeds unchanged. In addition, we would also like to verify that the call does not result in arithmetic overfow of the seller's proceeds because this can result in honest sellers losing their credits.

#### *B. Specifcation Language*

Listing 2 shows excerpts of the CELESTIAL versions of Item and SimpleMarket contracts. The general form of a CELESTIAL contract is shown in Listing 3. These annotations are Hoare-style specifcations, similar to languages like Dafny [36]. The specifcations are written over the contract felds, function arguments, as well as implicit variables such as balance (the contract balance), value (ether value in a payable method), and log (the transaction event log, formally modeled as a list of events). Our specifcations cover the full power of frst-order reasoning with quantifers, along with

```
1 contract Item {
2 address seller ; uint price ; address market ;
3 function getSeller () returns ( address s )
4 modifies []
5 post ( s == seller )
6 { return seller ; }
7 // other methods
8 }
9 contract SimpleMarketplace {
10 // contract fields
11 ...
12 invariant balanceAndSellerCredits {
13 balance == totalCredits &&
14 totalCredits >= sum_mapping ( sellerCredits )
15 }
16 function buy ( address itemId ) public
17 returns ( address seller )
18 modifies [ sellerCredits , totalCredits , itemsToSell ,
          log ]
19 tx_reverts !( itemId in itemsToSell )
20 || msg. value != itemsToSell [ itemId ]. price
21 || msg. value + totalCredits > uint_max
22 post (!( itemId in itemsToSell )
23 && sellerCredits == old( sellerCredits ) [
24 seller = > old( sellerCredits ) [ seller ] + msg.
                 value ]
25 && log == ( eItemSold , msg. sender , itemId ) :: old(
               log ) )
26 { // implementation of the buy function }
27 }
```

```
1 contract A {
2 uint x , y ; // fields , as usual
3
4 invariant { ϕ1 } // contract - level invariant
5
6 function foo () public
7 modifies [ x ] // fields that are modified
8 tx_reverts ϕ2 // revert condition ( under - specified )
9 pre ϕ3 // precondition
10 post ϕ4 // postcondition
11 { s } // Solidity implementation
12 }
```
Listing 3: A representative CELESTIAL contract

theories for arithmetic (both modular and non-modular), arrays and maps. We provide programmers the ability to write pure functions that can be invoked only from specifcations, not Solidity methods, to enable code reuse. We now explain the individual elements of CELESTIAL specifcations.

*a) Contract invariant:* Contract invariant is a predicate on the state of the contract (i.e. its feld values) that is expected to be valid at the boundaries of its public methods. When verifying a contract, the invariant is added to the pre- and postconditions of every public method. All contract felds in a CELESTIAL contract are necessarily private (see Section II-C). Additionally, CELESTIAL ensures that all its contracts are *external callback free* (Section IV) to disallow re-entrancy based attacks from external contracts. Hence, it is safe to assume the invariant at the beginning of public methods. Constructors are special; they only guarantee invariant in their postcondition but don't assume it as a precondition. For example, the invariant on line 12 in Listing 2 specifes that the contract's balance equals or exceeds the total proceeds from sales which has not been already claimed by the respective sellers (sum mapping is a library function for summing values in an int-valued map).

*b) Field updates:* The modifies clause specifes contract felds that a method can update. The getSeller method in Item has an empty modifies clause (line 4 in Listing 2), which specifes that the function may read the state of the contract, but cannot make any updates.

*c) Pre- and postconditions:* Preconditions (pre) are properties that hold at the beginning of a method execution. Public methods must have a trivial precondition true because they can be invoked by the untrusted external world. Postconditions (post) are properties that hold when the method terminates successfully (without reverting). The postconditions may refer to feld values at the beginning of the method using the old keyword. For example, the condition in line 23 in Listing 2 specifes that the fnal sellerCredits is the original sellerCredits map with only the seller key updated.

*d) Revert conditions:* tx reverts under-specifes the conditions under which a method reverts, i.e. if tx reverts holds at the beginning of a method, the method will defnitely revert. For example, the buy function defnitely reverts if the buyer invokes it with an item which is not available for sale, or the buyer provides ether which does not match the item price, or the totalCredits overfows. This is captured in the specifcation in line 19. Not specifying tx reverts is equivalent to tx reverts(false).

*e) Safe Arithmetic:* In Solidity, arithmetic operations may silently over- or underfow, whereas division by 0 results in reverts. CELESTIAL, when translating to F<sup>⋆</sup> , adds assertions before every arithmetic operation which check for no overand underfows, and division by 0. The programmer must add specifcations or runtime checks to allow the verifer to prove the safety of the arithmetic operations. CELESTIAL also provides a safe arithmetic library with built-in runtime checks (safe add operation in line 22 of Listing 1).

To summarize, we have expressed the following properties of the buy method. The revert condition specifes that the method reverts when the item is not present or the ether sent by the buyer does not match the item price. The method also reverts when totalCredits overfows. Since an invariant of the contract is that totalCredits is greater than the sum of pending credits of all the sellers, when totalCredits does not overfow, individual seller credits also don't overfow. Finally, line 23 in Listing 2 specifes that only the item seller's credits are incremented by price of the item, while credits for all other sellers remain same.

#### *C. Verifcation Scope and Limitations*

*a) Threat model:* All contracts and user accounts that are not part of a CELESTIAL project P are treated as the *external world* for P. The external world is free to initiate arbitrary transactions by calling public methods of P with arbitrary arguments. The external world, however, cannot directly access the private felds and methods of P.

*b) Trusted Computing Base:* The TCB of CELESTIAL includes the CELESTIAL compiler consisting of the two syntax translations, the F<sup>⋆</sup> model of the blockchain (Section III), the F ⋆ toolchain itself, and the Solidity compiler (these components are colored blue in Figure 1). With these components in our TCB, formal verifcation of smart contracts in CELESTIAL guarantees that when the compiled Solidity contracts are run on the blockchain, they behave as per their specifcations. We leave it as future work to minimize trust on our F<sup>⋆</sup> blockchain semantics (say, by testing it against a Solidity test suite).

*c) Solidity Language Restrictions:* CELESTIAL does not support delegatecall which is used to call functions from other contracts in a way that the callee may directly change the state of the calling address, thereby breaking the function call abstraction. Since this is insecure (for example, the ParityWallet [18] attack exploited it), the secure development recommendations suggest against its use [3]. CELESTIAL also does not support embedding EVM assembly. To check the prevalence of these features in real-world contracts, we performed an empirical study. In summary, we found that not more than 45% of highly used and highly valued contracts use these features, and even then in controlled manner where their usage is restricted to a small set of libraries.

*d) Modeling Limitations:* Our F<sup>⋆</sup> semantics does not model gas consumption. As a result, CELESTIAL contracts may revert due to out-of-gas exceptions. The model also does not cover low-level failures such callstack depth overfow. However, these failures can only cause the transaction to revert and therefore do not compromise the verifcation guarantees. Since we do not model all runtime exceptions, this is one of the reasons that the tx reverts condition for a function is an under-specifcation for when the function may revert. We also do not precisely model block-level parameters such as timestamp.

#### III. VERIFYING CELESTIAL CONTRACTS IN F ⋆

CELESTIAL compiles the contracts and their specifcations to F<sup>⋆</sup> , which are then verifed against a trusted F<sup>⋆</sup> library modeling the blockchain semantics. The library consists of the defnition of the blockchain state datatype and a custom F ⋆ *effect* that encapsulates this state behind the abstraction of an effect layer. We have carefully designed this abstraction to ensure that the verifcation is scalable and fully automated. The contracts call the stateful API exported by the library and specify precise changes to the blockchain state in their preand postconditions, that are verifed by F<sup>⋆</sup> .

#### *A. Blockchain state*

We model the blockchain state as consisting of 3 main elements: (a) state of all the contracts (i.e. values of the contract felds), (b) contract balances, and (c) an event log. Since in CELESTIAL all contract felds are private, a contract can directly read or write only its own felds, while interacting with the other contracts through method calls. The event log models the per-transaction event log of the Ethereum blockchain; contracts can use the Solidity emit API to output events to this log.

*a) Contracts state:* We model the state of all the contracts in the blockchain as a heterogeneous map from addresses to records, where the record corresponding to a contract instance contains the values of all its felds. For the Item contract from Listing 2, the record type would be:

type item t = { market : address; seller : address; price : uint }

Below is the API provided by the contract map (# parameters are implicit parameters inferred by F<sup>⋆</sup> at the call sites):

```
type address = uint (* 256 bit unsigned integers *)
val contract (a:Type) : Type (* a is the record of contract felds *)
val cmap : Type (* the heterogeneous contracts map *)
```

```
val live (#a:Type) (c:contract a) (m:cmap) : prop
val sel (#a:Type) (c:contract a) (m:cmap{live c m}) : a
val create (#a:Type) (m:cmap) (x:a) : contract a & cmap
val upd (#a:Type) (c:contract a) (m:cmap{live c m}) (x:a) : cmap
val addr of (#a:Type) (c:contract a) : address
```
The API defnes the type address as 256 bit unsigned integers. The contract type is parametric over the record type a that contains all the contract felds; for the Item contract, type a will be instantiated with item t. Type cmap is the heterogeneous contracts map type.

The sel function returns the a-typed record value mapped to a contract instance in the map. The API requires that the contract be live in the map (type m:cmap{live c m} is a refnement type that requires that the m argument at the call sites satisfes live c m). The liveness requirement basically says that the contract must be present in the contracts map, preventing sel to be called with arbitrary addresses. The create function returns the freshly created contract and the new cmap that includes a mapping for the new contract, internally assigning a fresh address to the new contract. The API is fully implemented in F<sup>⋆</sup> , we elide the implementation details for space reasons; all of our development is available online at https://github.com/microsoft/verisol/tree/celestial/Celestial.

*b) Contracts balance:* We model the contracts balance using a map from addresses to uint (the type of 256-bits unsigned integers). An alternative would have been to add balance as another one of the contract felds (thus maintaining them as part of the contracts map), but a separate map allows us to specify the balances for external accounts, that do not have an entry in the contracts map.

*c) Event log:* The event log is a list of events, where each event records the destination address, a string for event type, and a payload (a:Type & a is a dependent tuple that packages a Type and a value of that type):

type event = { to : address; ev typ : string; payload : (a:Type & a) } type log = list event

With these components, the blockchain state is the following record type:

type bstate = { cmap : cmap; balances : Map.t address uint; log : log }

#### *B. Libraries for arrays and maps*

We have implemented F<sup>⋆</sup> libraries for modeling Solidity arrays and maps—the uses of arrays and maps in CELES-TIAL contracts are translated to uses of these F<sup>⋆</sup> libraries. Our current implementation only supports dynamically-sized arrays for now, support for compile time fxed-sized arrays is future work. The libraries export operations that match the corresponding Solidity API, and several lemmas that enable the contracts to reason about their properties. For example, following is a snippet of our array library:

val array (a:Type) : Type *(\* an array with element type a \*)*

val push (#a:Type) (s:array a{length s < uint max}) (x:a) : array a

val push length (#a:Type) (s:array a{length s < uint max}) (x:a)

: Lemma (requires ⊤) (ensures (length (push s x) == length s + 1))

#### *C. An F*<sup>⋆</sup> *effect for contracts*

Having set up the model for the blockchain state, we now add a layer on top so that the contracts may manipulate the state and precisely specify the modifcations in pre- and postconditions, while making sure that the verifcation complexity does not get out-of-hands. We leverage the type-and-effect system of F<sup>⋆</sup> for this purpose.

F <sup>⋆</sup> distinguishes value types such as uint from *computation types*. Computation types specify the effect of a computation, its result type, and optionally some specifcations (e.g. preand postconditions) for the computation. For example, Tot uint classifes pure, terminating computations that return a uint value. Similarly uint → Tot uint is the type of pure, terminating functions that take a uint argument and return a uint result. uint → uint is a shorthand for uint → Tot uint; all the blockchain state functions that we have seen so far have an implicit Tot effect.

Following Ahman et al. [21], a state and exception effect for computations that operate on mutable state and may throw exceptions is as follows (st is the type of mutable state):

```
type result (a:Type) = (* the return type of the computations *)
  | Success : x:a → result a
  | Error : e:string → result a
```
effect STEXN a st (pre:st → prop) (post:st → result a → st → prop) = ...

The semantics of the computations in the STEXN effect may be understood as follows: a computation e of type STEXN a st pre post when run in an initial state (s0:st) satisfying pre s0, terminates either by throwing an exception (modeled as returning an Error-valued result) or by returning a value of type a (modeled as returning Success-valued result). In either case, the fnal state (s1:st) is such that post s<sup>0</sup> r s<sup>1</sup> holds, where r is the return value of the computation. F<sup>⋆</sup> also supports divergent effects, in which case the computations are also allowed to diverge. The STEXN effect in F<sup>⋆</sup> comes with a program logic for verifying such computations.

*a) Customizing STEXN for contracts:* Contract computations naturally fall into the state and exception effect; they read from and write to the mutable blockchain state, and they may throw an exception by calling revert.

However, the revert operation in Ethereum is slightly different from exceptions in, say, OCaml in that it also reverts the underlying state to what it was at the beginning of the transaction, while in OCaml, the state changes are retained. To accommodate this, we instantiate the state st in STEXN with

type st = { tx begin : bstate; current : bstate }

where the feld tx begin snapshots the state at the beginning of a transaction. Contracts modify the current state, unless they revert, in which case the current state is reset to tx begin. Thus, we defne the ETH effect for smart contracts as follows:

```
(* state + exception with st as the state *)
effect ETH (a:Type) (pre:st → prop) (post:st → result a → prop) =
  STEXN a st pre post
```
Using ETH effect, we implement the APIs for begin transaction, revert, and commit transaction as follows:

let begin transaction () : ETH unit (requires λ → ⊤) (ensures λs0 r s1 → is success r ∧ s0 == s1) = () *(\* no op \*)*

let revert () : ETH unit (requires λ → ⊤) (ensures λs0 r s1 → is err r ∧ s1=={s0 with current=s0.tx begin}) = ...

let commit transaction () : ETH unit (requires λ → ⊤) (ensures λs0 r s1 → is succ r ∧ s1=={s0 with tx begin=s0.current}) = ...

The function begin transaction is a no-op, its precondition is trivial (⊤), while its postcondition states that it does not revert (is success r) and it leaves the state unchanged (s0 == s1). revert, on the other hand, returns an error value, and its output state s1 is same as its input state s0 with current component replaced with the snapshot s0.tx begin, i.e. the state at the beginning of the transaction. commit transaction is opposite, it replaces the tx begin component with s0.current to commit the current state.

The function to get the current state for a contract is as follows, note that the contract is selected from the current component of the state:

```
let get contract (#a:Type) (c:contract a) : ETH a
  (requires λs → live c s.current.cmap)
  (ensures λs0 x s1 → x == Success (sel c s.current.cmap) ∧
s0 == s1) = ...
```
Similarly, the library provides functions send to transfer balance to a contract and emit to emit an event to the event log.

To make our specifcations easier to read and write, we defne the following effect abbreviation:

```
effect Eth (a:Type) (pre:bstate → prop) (revert:bstate → prop)
  (post:bstate → a → bstate → prop)
  = ETH a (requires λs → pre s.current)
      (ensures λs0 r s1 →
        (revert s0.current =⇒ Error? r) ∧
        (Success? r =⇒ post s0.current (Success?.x r) s1.current))
```
The pre- and postconditions in the Eth effect are written over the current blockchain state (bstate), as opposed to over the st record. Further, the postcondition is a predicate on a value of type a–it only specifes what happens when the contract function terminates successfully. The revert predicate is a predicate on the input state, which if valid means that the function reverts. We fnd this abbreviation well-suited for our examples, providing the full-fexibility of the ETH effect to the programmers is of course possible.

CELESTIAL translates each contract to an F<sup>⋆</sup> module, where the contract methods are translated to F<sup>⋆</sup> functions in the Eth effect. Every function gets explicit parameters for self, sender, value in the case of payable functions, and (underspecifed) block-level parameters such as timestamp; after these the function specifc parameters follow.

The F<sup>⋆</sup> precondition of each function gets to assume the liveness of the contract and the contract invariant. Since these functions can be called by arbitrary, non-verifed code, we cannot expect the callers to satisfy more sophisticated preconditions. The postcondition of each function includes the liveness, the contract invariant, and other function-specifc postconditions.

The translation of a function body uses the private, perfeld getters and setters, also emitted by the translation. Calls to public functions of other contracts are translated to calls to corresponding functions in other F<sup>⋆</sup> modules (contracts). Library calls to arrays, maps, etc. translate to corresponding libraries calls in F<sup>⋆</sup> .

We make a fnal comment regarding the correctness of the various translations. Since the CELESTIAL source language is just Solidity with specifcations, the CELESTIAL to Solidity translation is only spec erasure. The translation to F<sup>⋆</sup> is again quite systematic, and therefore, amenable to auditing. Formally proving that the CELESTIAL to F<sup>⋆</sup> translation is semantics preserving is an interesting and challenging future work.

#### IV. IMPLEMENTING CELESTIAL

The translators to F<sup>⋆</sup> , for specifcations as well as implementation, are combined 2300 lines of Python code. The specerasing translator to Solidity is about 750 lines of Python code. The blockchain model is around 1200 lines of F<sup>⋆</sup> code. We target the 0.6.8 version of the Solidity compiler for generating EVM bytecode. To aid developer experience, we have written a plugin for Visual Studio Code [16] that supports full syntax highlighting for CELESTIAL. If developers require access to the CELESTIAL specifcations in the generated Solidity, we can easily tweak the CELESTIAL to Solidity translation to preserve the specifcations as comments.

*Limitations:* We focused our implementation efforts on Solidity constructs used in our case studies. We currently do not support syntactic features such as inheritance, abstract contracts and tuple types. These mostly only provide syntactic sugar that should be easy to support in future versions of CE-LESTIAL. Our implementation currently also does not support passing arrays and structs as arguments to functions. While our implementation allows loops in contract functions, we currently do not support writing loop invariants. We also only provide weak specifcations for block level constructs (such as timestamp, number and gaslimit), transaction level constructs (such as origin and gasprice), and functions for obtaining hashes (such as keccak256 and sha256).

*Contract Local Reasoning:* Calling external contracts can lead to *reentrant* behavior where the external contract calls back into the caller, which is often diffcult to reason about. CELESTIAL disallows such behaviors by checking for *external callback freedom* (ECF) [28], [42] which states that every contract execution that contains a reentrant callback is *equivalent* to some behavior with no reentrancy. When this property holds, it is suffcient to reason about non-reentrant

```
1 contract A {
2 bool lock ;
3 function foo () public
4 tx_reverts lock
5 { if( lock ) { revert ; } ... }
6
7 function bar ( address x ) {
8 lock = true ;
9 // external call
10 x . call (...) ;
11 lock = false ;
12 ...
13 } }
```
behaviors only: any specifcation over those set of behaviors will hold for all behaviors as well. Thus, ECF allows for contract-local reasoning.

CELESTIAL has two ways of checking for ECF; one of these must hold for each external call. The frst is a lightweight syntactic check from VERX [42]. An external call is deemed ECF compliant if it is guaranteed to only be called at the end of a transaction. In other words, for any public method that may transitively invoke an external call, it must ensure that it does not read or write to the blockchain state after the call. External calls that do not fall in this category must satisfy CELESTIAL's second check that asserts that any callbacks made by an external call are guaranteed to revert. We explain this check using the CELESTIAL contract shown in Listing 4. There is an external call in method bar on line 10. To prevent reentrancy, the programmer uses a contract feld called lock and follows the protocol that the lock will be assigned true when making an external call. Furthermore, each public method of the contract (such as foo) will revert if lock is set to *true*. It is easy to see that if the external contracts tries to call back a method of A, the transaction will abort.

CELESTIAL's translation to F<sup>⋆</sup> adds a sequence of assertions preceding each external call (that does not satisfy CELES-TIAL's frst check). For each public method of the contract, it takes the tx reverts condition on the method, say ϕ, and inserts assert ϕ before the external call. This will ensure that a call back to a public method is guaranteed to revert.

### V. EVALUATION

We evaluate the development experience with CELESTIAL by writing verifed versions of 8 Solidity smart contracts, including real-world contracts spanning crypto-currency tokens, wallets, marketplace, auctions and governance. Some of these contracts are "high-valued", holding millions of dollars of fnancial assets or having processed millions of transactions.

For each contract, we added detailed functional specifcations. If the verifcation failed, we minimally modifed the code in order to discharge the verifcation conditions. For contracts which required such modifcations, we additionally measured the gas consumption overhead, using Truffe [13]. We performed our experiments using an Intel Core i7-7600U dual-core CPU, with 16GB RAM, and running Windows 10. Table I summarizes the various case studies that we performed.

Fig. 3: The AssetTransfer state machine. The dashed arrow indicates a buggy state transition.

Due to lack of space, we discuss details of 3 of the case studies here. We refer interested readers to our Technical Report [25] for a detailed discussion of all the case studies. The sources for all the case studies are available at

https://github.com/microsoft/verisol/tree/celestial/Celestial.


TABLE I: CELESTIAL case studies. We report the number of contracts in the application (#C), LOC of the original Solidity implementation (#Sol), LOC of the CELESTIAL version, divided between specifcation (#Spec) and implementation (#Impl), and fnally the F<sup>⋆</sup> verifcation time (averaged over 3 runs). Benchmarks marked with \* used a safe arithmetic library, which is added towards #Impl.

#### *A. AssetTransfer*

*Application:* AssetTransfer [10] is a microbenchmark that provides a smart contract based solution for transferring assets between a buyer and a seller. The contract encodes asset transfer as a fnite state machine (FSM) (Figure 3), a common design pattern [11], [39], with the different states denoting the varying stages of approval for the transfer. The contract has notions of *roles*, such as Buyer and Seller, and state transitions are guarded by appropriate roles (for example, the contract can transition from Active to OfferPlaced when the Seller invokes the MakeOffer method).

*Specifcations.* Figure 3 is also the specifcation for this contract, that is, we must ensure that each of the contract methods respect the transitions mentioned in the FSM diagram. For example, the following is the spec for MakeOffer:

```
function MakeOffer ( uint _price )
  modifies [ sellingPrice , state , log ]
  tx_revert (old( state ) != Active && msg . sender != Seller )
  post ( state == OfferPlaced && sellingPrice == _price )
{ // implementation }
```
The spec ensures that the method makes the correct state transition (Active → OfferPlaced), and this transition can only be caused by the Seller. Interestingly, this spec failed to verify, which led us to discover two bugs in the implementation. These bugs could potentially leave the whole transfer in a frozen state. For instance, one of the bugs led to the erroneous state transition shown in Figure 3. It caused the contract to mistakenly transition to the SellerAccept state, even after both the Seller and Buyer had accepted the transfer, which makes the fnal state (Accept) to become unreachable. Fixing these bugs allowed verifcation to go through. Previous work [47] has noted similar bugs in a different version of the contract. The original contract also had overfow/underfow vulnerabilities, which we eliminated using runtime checks.

*Performance.* We ran both contracts (CELESTIAL-generated Solidity and original Solidity) through a typical asset-transfer workfow. On an average, the CELESTIAL version consumed 1.12× more gas compared to the original. We account for both the contract as well as any associated library, for instance for safe arithmetic, when measuring the deployment cost.

### *B. ERC20 Tokens*

*Application.* ERC20 is a standard [4] for Ethereum cryptocurrencies (or *tokens*). Till date, over 400K ERC20 tokens have been deployed on Ethereum, handling fnancial assets worth *billions of dollars*. We formally verifed the OpenZeppelin ERC20 contract [8], which is a popular reference implementation of some of the key ERC20 functions, such as transferring tokens from one account to another and approving third parties to spend tokens on a user's behalf. We also verifed the ERC20-based BinanceCoin (BNB) [2] token.

*Specifcations.* We based some of our specifcations on earlier efforts to formally verify the OpenZeppelin ERC20 token [6], [47]. The following shows an excerpt. The implementation maintains the balance (number of issued tokens) for each contract address using a balances map. CELESTIAL allows us to easily express the important invariant (line 4) that the sum over the balances for each user equals the total number of tokens issued.

```
1 contract ERC20 {
2 mapping ( address = > uint ) _balances ;
3 uint _totalSupply ; // total issued tokens
4 invariant _balanceAndSellerCredits {
5 _totalSupply = sum_mapping ( _balances )
6 }}
```
The remaining specifcations capture the business logic of key ERC20 functions. The example below shows the postcondition for the transfer method that is used for atomically debiting a source account, and crediting the amount in a destination account. The postcondition ensures that the correct debit and credit operations occur in the source and destination accounts, and all other accounts remain unchanged.

```
1 function _transfer ( address from , address to , uint amt )
2 private tx_reverts ... , modifies [...]
3 pre _balances [ from ] >= amt &&
4 _balances [ to ] + amt <= uint_max
5 post ite( from == to , _balances == old ( _balances ) ,
6 _balances == old( _balances ) [
7 from = > old ( _balances ) [ from ] - amt ,
8 to = > old( _balances ) [ to ] + amt ]) )
9 { // implementation }
```
The ERC20 token makes copious use of arithmetic operations. OpenZeppelin designed a SafeMath Library [9] to perform runtime checks for overfows and underfows, which the original ERC20 token leverages to ensure runtime safety for arithmetic operations. In contrast, we used the CELESTIAL safe arithmetic operations in public functions, and eliminated runtime checks altogether in private functions when the arithmetic was provably safe.

#### *C. Governance Contract*

*Application.* We study a contract from Microsoft that manages a consortium of mutually-trusted members interacting on a *private* Ethereum blockchain. The contract comprises a set of rules governing operations such as inviting fresh members to join the consortium and adding or removing existing members. The contract is complex, since it maintains many correlated data structures, loops and access control policies, with each logical operation involving intricate changes to multiple data structures. Due to the proprietary nature of the contract, we abstain from showing code or specifcations for it explicitly. We did not include several functions in the original contract, whose operations were orthogonal to the governance logic.

*Specifcations.* We briefy describe some of the important properties that we proved.


We note that some of these properties are similar to those proved by Lahiri et al [35] for a variation of an open-source governance contract [14].

#### VI. RELATED WORK

The literature on ensuring correctness of smart contracts can be classifed into the following broad categories.

*Surveys and Best Practices.* There is a wealth of available material that highlights known vulnerabilities and exploits in smart contracts [22], [24], [41], [46]. These efforts have resulted in literature suggesting best coding practices for Solidity [5], [12]. CELESTIAL is inspired by these practices, for instance, by ruling out low-level instructions as well as uncontrolled reentrancy, however, the restrictions are not just for avoiding programming pitfalls, but rather to aid semantic verifcation.

*Testing.* Frameworks like Truffe [13] allow users to write unit and integration tests for smart contracts in JavaScript. The transactions are typically executed in an in-memory mock of the EVM, such as Ganache [7]. In addition to testing functional behaviors and fnding bugs, such tests reveal useful diagnostic information such as gas consumption.

*Contract Analysis.* A large number of tools have been developed that statically analyze smart contracts (Solidity source code or EVM bytecode) to reveal various vulnerabilities. Examples include MadMax [27] (targeting vulnerabilities due to gas exceptions) and Slither [26] (for identifying security vulnerabilities). Oyente [38] leverages symbolic execution to rule out several classes of vulnerabilities. ContractFuzzer [33] offers a fuzzing based solution for identifying security bugs.

Solythesis [37] is a source-to-source Solidity compiler that instruments the Solidity code with runtime checks to enforce invariants, but specifcations particular to each function can't be specifed in this framework and it has a signifcantly high gas overhead because of the runtime checks. VeriSmart [44] offers a highly precise verifer for ensuring arithmetic safety of Ethereum smart contracts, which discovers transaction invariants, but is unable to capture quantifed transaction invariants. Tools like teEther [34] leverage symbolic execution to fnd vulnerable executions and automatically generate exploits.

Each of these tools target a known set of vulnerabilities and offer specialized solutions for them. In contrast, CELESTIAL verifes custom specifcations of contracts, relying on verifcation to rule out all vulnerabilities against that specifcation. *Formal Verifcation.* VeriSol [35], [47] checks conformance between a state-machine-based workfow and the smart contract implementation, for contracts of Azure Blockchain Workbench [1]. VeriSol does not check for reentrancy; it simply assumes its absence, as opposed to CELESTIAL that enforces it as part of the contract verifcation. Further, VeriSol does not model arithmetic over/underfow, or check for unsafe type casts, which were an important aspect of our case studies.

VerX [15], [42] is another formal verifcation tool. VerX uses a syntactic check to ensure ECF (which we use in CELESTIAL as well), however it cannot verify that the program in Listing 4 satisfes ECF. VerX aims for automation of verifcation by inferring predicates in an abstraction-refnement loop. Such techniques tend to be limited in their ability to reason with quantifers; VerX uses special built-in predicates like sum for quantifed reasoning over maps. CELESTIAL, on the other hand, allows for the full power of frst-order reasoning with quantifers. VerX implements its own custom symbolic execution, whereas CELESTIAL uses a simple syntax translation to F<sup>⋆</sup> and delegates all analysis to the mature F<sup>⋆</sup> verifer. Unfortunately, the VerX tool is not openly available for further comparisons.

Some verifcation tools work at the level of EVM bytecode [30], [31], [40], [43], instead of Solidity source level. This is more precise and removes the Solidity compiler from the TCB, however, it is also more time consuming and hard to scale to the larger, complex contracts that we have evaluated in Section V. Bhargavan et al. [23] provide an approach to translate a subset of Solidity to F<sup>⋆</sup> for verifcation, as well as a method to decompile EVM bytecode to F<sup>⋆</sup> to check lowlevel properties such as establishing worst-case gas bounds for a transaction. Their work is presented as a proof-of-concept only, with limited evaluation and restricted to a small subset of the language.

#### VII. CONCLUSION

We presented CELESTIAL, a framework for developing formally verifed smart contracts. CELESTIAL provides fully automated verifcation, using F<sup>⋆</sup> , of Solidity contracts annotated with functional correctness specifcations. With the help of several real-world case studies, we conclude that formal verifcation can be made accessible to smart contract developers for programming high-assurance contracts. Our next steps include enriching our F<sup>⋆</sup> model of blockchain with more features and validating it using the Solidity testsuite as well as exploring proofs of cross-transaction properties.

#### REFERENCES


*Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020*, pages 438–453. ACM, 2020.


semantic framework. *J. Log. Algebraic Methods Program.*, 79(6):397– 434, 2010.


# The Civl Verifer

Bernhard Kragl *Amazon Web Services* and *IST Austria* Shaz Qadeer *Facebook*

*Abstract*—Civl is a static verifer for concurrent programs designed around the conceptual framework of layered refnement, which views the task of verifying a program as a sequence of program simplifcation steps each justifed by its own invariant. Civl verifes a layered concurrent program that compactly expresses all the programs in this sequence and the supporting invariants. This paper presents the design and implementation of the Civl verifer.

#### I. INTRODUCTION

Correctness of critical specifcations of concurrent systems rests upon invariants about the global system state. The classical approach to static verifcation is to represent the entire organizational structure—processes, threads, procedures, looping, branching, sequencing—of a concurrent system as a fat transition relation that encodes its operational semantics. Further reasoning is performed on this transition relation. This approach leads to massively complex invariants that are hard to specify for the programmer and diffcult to verify via automated tools.

$$\begin{array}{ll} \text{a: } x := n\\ \text{b: } \mathsf{acquire}(l) \\ c: \ t\_1 := x \\ d: \ x := t\_1 + 1 \\ e: \ \mathsf{release}(l) \end{array} \quad \begin{array}{ll} \text{a:} \ x := n\\ \text{a:} \mathsf{equiv}(l) \\ t\_2 := x \\ x := t\_2 + 1 \\ \text{rule}(l) \end{array}$$
 
$$f \text{: } \mathsf{assert}\ x = n + 2$$

Fig. 1. Parallel increment (version 0).

We motivate our work using the program in Figure 1. This program starts with a single thread that initializes a global variable x to a constant n, creates two threads that run in parallel each incrementing x by 1 while holding the lock l, waits for the two threads to fnish, and then asserts that x = n + 2. The goal of verifcation is to prove this assertion for all values of n and all executions of the program.

The classical approach to verifcation of concurrent programs models the verifcation problem in Figure 1 as a transition system shown in Figure 2, comprising an initial predicate Init, a transition predicate Next, and a safety predicate Safe. To prove that all reachable states of the transition system satisfy the predicate Safe, an inductive invariant Inv must be invented such that Init ⇒ Inv, Inv ∧ Next ⇒ Inv′ , and Inv ⇒ Safe.

Init: pc = pc<sup>1</sup> = pc<sup>2</sup> = a

Next:

pc = a ∧ pc<sup>0</sup> = pc<sup>1</sup> = pc<sup>2</sup> = b ∧ x <sup>0</sup> = n ∧ eq(l, t1, t2) ∨ pc<sup>1</sup> = b ∧ pc<sup>0</sup> <sup>1</sup> <sup>=</sup> <sup>c</sup> <sup>∧</sup> <sup>l</sup> <sup>=</sup> # <sup>∧</sup> <sup>l</sup> <sup>0</sup> = ① ∧ eq(pc, pc<sup>2</sup> , x, t1, t2) ∨ pc<sup>1</sup> = c ∧ pc<sup>0</sup> <sup>1</sup> = d ∧ t 0 <sup>1</sup> = x ∧ eq(pc, pc<sup>2</sup> , l, x, t2) ∨ pc<sup>1</sup> = d ∧ pc<sup>0</sup> <sup>1</sup> = e ∧ x <sup>0</sup> = t<sup>1</sup> + 1 ∧ eq(pc, pc<sup>2</sup> , l, t1, t2) ∨ pc<sup>1</sup> = e ∧ pc<sup>0</sup> <sup>1</sup> = f ∧ l <sup>0</sup> <sup>=</sup> # <sup>∧</sup> eq(pc, pc<sup>2</sup> , x, t1, t2) ∨ pc<sup>2</sup> = b ∧ pc<sup>0</sup> <sup>2</sup> <sup>=</sup> <sup>c</sup> <sup>∧</sup> <sup>l</sup> <sup>=</sup> # <sup>∧</sup> <sup>l</sup> <sup>0</sup> = ② ∧ eq(pc, pc<sup>1</sup> , x, t1, t2) ∨ pc<sup>2</sup> = c ∧ pc<sup>0</sup> <sup>2</sup> = d ∧ t 0 <sup>2</sup> = x ∧ eq(pc, pc<sup>1</sup> , l, x, t1) ∨ pc<sup>2</sup> = d ∧ pc<sup>0</sup> <sup>2</sup> = e ∧ x <sup>0</sup> = t<sup>2</sup> + 1 ∧ eq(pc, pc<sup>1</sup> , l, t1, t2) ∨ pc<sup>2</sup> = e ∧ pc<sup>0</sup> <sup>2</sup> = f ∧ l <sup>0</sup> <sup>=</sup> # <sup>∧</sup> eq(pc, pc<sup>1</sup> , x, t1, t2) ∨ pc<sup>1</sup> = pc<sup>2</sup> = f ∧ pc<sup>0</sup> = f ∧ eq(pc<sup>1</sup> , pc<sup>2</sup> , l, x, t1, t2) Safe: (pc = f ⇒ x = n + 2) ∧ (pc<sup>1</sup> ∈ {c, d, e} ⇒ l = ①) ∧ (pc<sup>2</sup> ∈ {c, d, e} ⇒ l = ②)

Fig. 2. Transition relation of the program in Figure 1. The lock l can be either available (value #), or held by the frst or second thread (values <sup>①</sup> and <sup>②</sup>). The predicate eq denotes unmodifed variables, e.g., eq(l) means l <sup>0</sup> = l.

This approach is clearly problematic for several reasons. First, the encoding as a transition system fattens and eliminates the syntactic structure of the program. Forcing the programmer to think about the inductive invariant at the level of this encoding signifcantly reduces productivity. Second, the inductive invariant is likely to have as much case analysis as the encoded transition relation, making it even more tedious and unproductive for the programmer to specify it. For example, the inductive invariant for our example program is larger than its transition relation. This trivial parallel increment program is just the tip of the iceberg; the task of specifcation and verifcation explodes in complexity if we turn our attention to realistic implementations of large concurrent systems.

There are two broad approaches to the problem of inductive invariants for concurrent systems. One approach is automatic generation of inductive invariants [1], [2], [3] eliminating the need to specify them manually. Another approach is to specify them via annotations on the structured program itself [4], [5] reducing the cognitive burden on the programmer. Civl falls into this latter class of techniques; its contribution is to allow more proofs to be expressed on the structured program.

Civl proposes an alternative proof strategy which encourages the programmer to think in terms of a sequence of program versions that increasingly simplify the original program. Denoting the program in Figure 1 as version 0, we show three progressively simpler versions in Figure 3.

The simplifcation from version 0 to version 1 is based on mover types [6], [7]. Acquiring of lock l is a right mover, release of lock l is a left mover, and accesses to the shared variable x protected by the lock l are left and right movers.

This research was performed while Bernhard Kragl was at IST Austria, supported in part by the Austrian Science Fund (FWF) under grant Z211-N23 (Wittgenstein Award).


Fig. 3. Simplifying parallel increment.

Consequently, the code fragment executed by each child thread can be treated as an atomic block which executes in one step.

The simplifcation from version 1 to version 2 summarizes each atomic block with an atomic increment of x, while hiding global variable l and local variables t<sup>1</sup> and t2. This summarization is possible because each atomic block leaves the value of l unchanged.

Finally, the simplifcation from version 2 to version 3 applies mover types again. Since each atomic increment is both a left and right mover, the two parallel increments can be converted into a sequence of two increments. Version 3 can be verifed trivially by constructing a sequential verifcation condition and using an SMT solver to discharge it.

There are several advantages of the Civl approach. First, the transition relation of the program is never exposed to the programmer who specifes program versions using the familiar syntax of structured concurrent programs. Second, although an invariant may be needed to justify a program transformation in general, each invariant is simpler because it justifes only one transformation. Finally, invariants, even when they are needed, are supplied by annotating the structured program itself.

Section II presents a high-level overview of layered refnement, the collection of techniques underlying the Civl approach. Taken together, these techniques increase proof productivity by allowing the correctness argument to be expressed as a single layered concurrent program [8]. This section is targeted to an expert in the theory of concurrency verifcation and may be skipped on a frst reading of the paper. Section III presents the modeling and specifcation features available to a Civl user through concrete examples.

Since the frst published description of Civl [9], we have reimplemented the verifer completely. Section IV describes the current architecture of the Civl implementation as a conservative extension of the Boogie verifer.

The main contribution of Civl is a methodology supported by automated reasoning for implementing verifed concurrent systems. We present two arguments that Civl improves the state of the art in constructing verifed programs. First, Civl clearly allows new proofs of concurrent systems to be expressed. Second, these proofs have been accomplished on many programs by many researchers including several who were not involved in the design and implementation of Civl. Section V presents this accumulated experience.

#### II. LAYERED REFINEMENT

Civl advocates layered refnement over structured concurrent programs. Instead of proving the safety of a program in one shot, the new approach allows the programmer to specify a chain of increasingly simpler programs starting from the original program. Each link of the chain, from program P to program Q, represents a single simplifcation that may be viewed as an abstraction from P to Q or a refnement from Q to P. The correctness of the program is established piecemeal by focusing on the simpler invariant required for each refnement step separately. Most importantly, all the layers and the supporting invariants are specifed as a structured and layered concurrent program [8], thus hiding the low-level transition relation from the programmer.

Layered concurrent programs introduce a succinct presentation for multi-layer refnement proofs, which offer two major advantages for interactive proof construction. First, through a syntax for expressing "data layering" (i.e., which variables live on which layers) and "control layering" (i.e., which operations live on which layers), it is easy for the user to write, refne, and maintain a proof outline. Second, a layered concurrent program expresses only the changes in the program from one layer to the next. Thus, layered concurrent programs can result in much smaller proofs, especially for large programs.

While traditional approaches view refnement as a mechanism to specify behavior of concurrent programs, Civl views refnement as a tactic to simplify verifcation of safety properties. Consequently, the simulation relation justifying the refnement step in Civl is computed but never revealed to the programmer who focuses only on the program layers and the connecting invariants. The viability of the layered refnement approach depends on the existence of program simplifcation tactics that are easy to use by the programmer and whose justifcation can be checked automatically. Civl incorporates a number of such tactics described below.

*Creating atomic blocks.* The Civl programming model comprises concurrently-executing and dynamically-created tasks operating over global memory, each access to which must be encapsulated inside an indivisible atomic action. Global variables model either shared memory or communication channels. Civl uses a theory of commutative atomic actions [6], [7] to create sequential code blocks that appear to execute atomically, despite accesses to global state by multiple atomic actions in the code block.

*Creating atomic actions.* An atomic code block might be internally complex, due to sequencing, branching, looping, and recursion. Civl summarizes such a code block with an atomic action that hides all the internal details in favor of a declarative specifcation. Thus, atomic actions in Civl are used to model both low-level execution primitives and high-level summary specifcations. To support such diverse usage, an atomic action in Civl generalizes a guarded command [10] to include a specifcation of failure [11] (in addition to blocking or successful execution) and the creation of asynchronous activity in the form of pending asyncs [12].

*Synchronizing asynchrony.* Civl supports elimination of pending asyncs from the atomic actions in a program via a tactic known as inductive sequentialization [13]. Introduction and elimination of pending asyncs in atomic actions together enable a program simplifcation that provides the appearance of executing in one step a collection of atomic computations executing asynchronously. This tactic amplifes the use of commutative atomic actions to allow summarization of both synchronous and asynchronous computation.

Civl allows introduction and hiding of global and local variables to change the state representation of the program. This change often results in a program whose atomic actions become commutative and thus the other tactics mentioned above become applicable. Variable introduction is performed as part of the tactic that creates atomic blocks; calls to special atomic actions assign meaning to the introduced variables. Variable hiding is performed as part of the tactic that creates atomic actions from atomic blocks; the created atomic action does not refer to the hidden variables.

Variable introduction and hiding in Civl has two other benefts. First, variable introduction naturally allows the user to introduce an arbitrary safety specifcation for the program. Second, it becomes unnecessary to support the notion of ghost state present in most provers for concurrent programs. Changing the state representation of the program often addresses the need for ghost state. Also, a variable may be introduced and hidden at the same layer for those special cases when ghost state is needed purely for invariant specifcation.

The tactic that creates atomic actions often needs constraints on the reachable states of the program. These constraints are supplied via yield invariants [14] which are named and parameterized invariants that can be reused and suitably instantiated across multiple program locations where interference may happen. Yield invariants combine the precision and fexibility of location invariants [4] with the compactness and modularity of rely-guarantee specifcations [5]. Civl supports local reasoning with permissions that are redistributed by atomic actions and otherwise passed around the program without duplication [14]. Permissions are useful in proving locally both that yield invariants are interference-free and that atomic actions satisfy desired commutativity properties.

Civl supports the verifcation of arbitrary safety properties. Civl's notion of correctness is that the lowest-layer program is free of assertion failures. Arbitrary safety properties are expressible as assertions because auxiliary state (e.g., history variables) can be introduced into the program in addition to program state.

The client of a system constructed with layered refnement only needs to check that the established high-level specifcation captures the desired property. The details of a layered proof are not trusted since they are checked by Civl. However, the introduction of auxiliary state into the system at the lowest layer, sometimes needed to express a specifcation, is trusted.

#### III. PROGRAMMING AND PROVING IN CIVL

In this section we illustrate the input language and the verifcation features of Civl. The presentation is necessarily brief and selective. Detailed documentation is available at our website civl-verifer.github.io.

Syntax. Civl is built on top of Boogie [15], a language and verifer for sequential programs. Boogie provides standard features for imperative programming such as assignments, sequencing, branching, looping, and procedures. Additionally, it provides specifcation features such as assert and assume statements, loop invariants, preconditions, postconditions, and axioms. The expression language of Boogie is frst-order logic with built-in theories such as uninterpreted functions, integers, bitvectors, datatypes, and arrays. Civl adds the keywords async (asynchronous procedure call), par (parallel procedure call), and yield (yield point) to express concurrent behaviors. All other syntactic extensions are implemented using generic *attributes* which attach to abstract syntax tree nodes of a Boogie program. Attributes are of the form {:attr e1, e2, ...}, where attr is the attribute name and e1, e2, ... are parameter expressions of the attribute. Atomic actions. Every access to a global variable has to be encapsulated into an atomic action. An atomic action consists of a *gate*, a one-state predicate that specifes the condition under which the action can execute or otherwise fail, and a *transition relation*, a two-state predicate that specifes the possible state updates of the action. Atomic actions are capable of specifying uniformly both low-level operations (like writing to a memory location or sending a message on a

channel) and high-level operations (like acquiring a lock or reaching consensus in a distributed system). For example, the left column in Figure 4 shows atomic actions which acquire and release a lock, modeled by the global variable l. The Boogie procedures are identifed as atomic actions by the :right/:left annotations which also declare their mover types; actions that are non-movers are annotated with :atomic. The action AcquireSpec blocks until l equals None() (denoting the availability of the lock) and then updates l to Some(tid) (denoting that the lock is held by the current thread with thread id tid). Conversely, ReleaseSpec asserts that the current thread holds the lock (the assert statement specifes the gate) and updates l to None().

Program layers. In a Civl proof, the user explicitly organizes the program into layers using *layer annotations*. Variables and atomic actions have a *layer range*. In Figure 4, variable l is introduced at layer 1 and hidden at layer 2, and action AcquireSpec only exists at layer 2.

Concurrent computations are expressed by *yielding procedures*. The yielding procedure Acquire in Figure 4 acquires a lock by repeatedly invoking the compare-and-swap operation CAS\_b to atomically set the global Boolean variable b from false to true. A yielding procedure is subject to interference from other concurrent threads at any point during its execution. However, Acquire is declared to *refne* the atomic action AcquireSpec at layer 1. This means that Civl checks that

```
var {:layer 1,2} l: Option Tid;
procedure {:right} {:layer 2,2}
AcquireSpec({:linear "tid"} tid: Tid)
modifies l;
{
  assume l == None();
  l := Some(tid);
}
procedure {:left} {:layer 2,2}
ReleaseSpec({:linear "tid"} tid: Tid)
modifies l;
{
  assert l == Some(tid);
  l := None();
}
                                              var {:layer 0,1} b: bool;
                                              procedure {:yields} {:layer 1}
                                                {:refines "AcquireSpec"}
                                                {:yield_preserves "LockInv"}
                                              Acquire({:layer 1}{:linear "tid"} tid: Tid)
                                              {
                                                var t: bool;
                                                while (true)
                                                   invariant {:layer 1}{:yields}
                                                     {:yield_loop "LockInv"} true;
                                                {
                                                   call t := CAS_b(false, true);
                                                   if (t) {
                                                     call set_l(Some(tid));
                                                     break;
                                                   }
                                                }
                                              }
                                                                                                procedure {:intro} {:layer 1}
                                                                                                set_l(v: Option Tid)
                                                                                                modifies l;
                                                                                                { l := v; }
                                                                                                procedure {:yields} {:layer 2}
                                                                                                  {:refines "ClientSpec"}
                                                                                                  {:yield_preserves "LockInv"}
                                                                                                Client({:layer 1,2} {:hide}
                                                                                                       {:linear "tid"} tid: Tid)
                                                                                                {
                                                                                                  call Acquire(tid);
                                                                                                  ...
                                                                                                  call Release(tid);
                                                                                                }
                                                                                                procedure {:atomic} {:layer 3,3}
                                                                                                ClientSpec()
                                                                                                { ... }
```
Fig. 4. A layered program, showing a lock implementation and its client. Left: Atomic actions for acquiring and releasing a lock. Middle: A spinlock implementation that refnes the atomic action specifcation. Right: Introduction action for proving the lock refnement and a client of the lock.

Acquire "behaves like" AcquireSpec, and thus clients of the former can ignore the details of its implementation and instead reason with the atomic behavior of the latter. Acquire uses the global Boolean variable b, while AcquireSpec uses the global lock variable l. The connection between these two different representations is established by the *introduction action* set\_l, which sets l from None() to Some(tid) when b is set from false to true. Finally, the yielding procedure Client protects a critical section with calls to Acquire and Release and declares that it refnes the action ClientSpec at layer 2.

The layer annotation of a yielding procedure denotes its *disappearing layer*. The procedure exists (with changing bodies) on all layers below and up to its disappearing layer. For example, Acquire exists on layer 0 and 1, and Client exists on layer 0, 1, and 2. Intuitively, a procedure is replaced with its refned atomic action above its disappearing layer.

Figure 4 encodes four program layers. Layer 0 is the most concrete program. It contains procedure Client which calls procedure Acquire, and Acquire implements a spinlock using calls to CAS\_b; b is the only global variable, and Client and Acquire have no input parameters. Layer 1 introduces the global variable l and the local input parameters tid, along with the introduction action set\_l (the call to set\_l does not exist at layer 0). At layer 2, Acquire is gone and the body of Client is rewritten to make calls to the actions AcquireSpec and ReleaseSpec; b is hidden and l is the only global variable. At layer 3, Client is also gone, and any potential calls to Client are replaced by its atomic summary ClientSpec; global variable l and the parameter tid do not exist anymore.

Layering provides a form of modularity. At layer 2 we do not care about how the lock is implemented, and at layer 3 we do not care that a lock was used at all. The applied proof tactics (variable introduction, variable hiding, and atomic blocks) simplify the necessary invariants on every layer.

Yield suffciency. Civl partitions the bodies of yielding procedures into *yield-to-yield fragments*. The following code locations are *yield points*: procedure entry and exit, loop headers annotated with {:yields}, and explicit yield statements. Context switches are only considered at yield points, and the code between two yield points is a yield-to-yield fragment. At layer 1, in Acquire every loop iteration (i.e., call to CAS\_b) is a yield-to-yield fragment, and in Client there is a yield before and after every call. At layer 2, something interesting happens. The body of Client does not call any procedures anymore (the calls are to atomic actions now), and thus Client has only a single yield-to-yield fragment. Civl justifes this simplifcation using *reduction* [6], [7]. Concretely, using the fact that AcquireSpec is a *right mover* and ReleaseSpec is a *left mover*. In general, every yield-to-yield fragment is checked to be a sequence of right movers, followed by at most one non-mover, followed by a sequence of left movers.

Refnement. To justify the summarization of a yielding procedure at layer n by an atomic action, Civl checks that in every execution of the procedure, the effect of the refned action happens in exactly one yield-to-yield fragment and that other yield-to-yield fragments leave the layer-(n + 1) state unchanged. In Acquire, every loop iteration where CAS\_b fails leaves l unchanged, while the (fnal) iteration where CAS\_b succeeds also updates l to Some(tid) and thus produces the effect of AcquireSpec.

Invariants. Civl performs refnement checking modularly, by considering every yield-to-yield fragment in isolation. This usually requires certain properties to hold at yield points, notwithstanding any interference from other concurrent threads. Civl supports location invariants [4] and yield invariants [14], which are checked to be interference-free across all yield-to-yield fragments in the program. Yield invariants are named and parameterized invariants that can be reused and suitably instantiated across multiple yield points. The following code shows the yield invariant LockInv.

```
procedure {:yield_invariant} {:layer 1} LockInv();
requires b <==> (l != None());
```
In Acquire (Figure 4), LockInv is attached to the procedure entry and exit using the :yield\_preserves annotation, and to the loop header using the :yield\_loop annotation. We give examples of parameterized yield invariants below.

Permissions. Certain invariants, like those connecting local variables from different scopes, can be tedious to express and propagate. Civl addresses this problem using *linear permissions*. Program variables can be declared as *linear*, from which Civl calculates the *available* variables at every control location, assigns every available variable a set of *permissions*, and ensures that there is no duplication across these permission sets. Civl allows the user to customize the type of permissions and the assignment of permissions to variables.

The lock specifcation in Figure 4 uses linearity to express unique thread identifers. The type declaration

type {:linear "tid"} Tid;

specifes the permissions for the *linear domain* tid to be of type Tid, the type of thread identifers. This means that every variable that is linear under domain tid gets assigned a set of Tid values. The assignment is specifed using *collector* functions. Civl uses the following default collector in the absence of a user-specifed collector.

function {:linear "tid"} TidCol(x: Tid) : [Tid]bool { MapConst(false)[x := true] }

We use a map from Tid to bool to model a set. The polymorphic map constructor MapConst applied to false returns a map set to false everywhere representing an empty set. TidCol assigns linear variables of type Tid (like the input parameter tid of AcquireSpec and ReleaseSpec) the single value the variable contains as its permission. Consider an instance of AcquireSpec and an instance of ReleaseSpec with parameters tid1 and tid2, respectively. By linearity, Civl gets to assume that the multiset TidCol(tid1) ⊎ TidCol(tid2) = {tid1, tid2} does not contain any duplicates, which implies tid1 ̸= tid2. This assumption is used to show that the AcquireSpec instance commutes to the right of the ReleaseSpec instance, an important part of the proof that AcquireSpec and ReleaseSpec satisfy their mover types.

Figure 5 presents an example inspired by barrier synchronization to demonstrate how permissions are useful in proving invariants. The program has two global variables, barrier and count, to represent the set of identifers inside the barrier and the number of threads outside the barrier, respectively. The atomic actions EnterBarrier and ExitBarrier encode entering and exiting the barrier by a thread, respectively. The yield invariant ThreadInv is parameterized by a thread identifer j and indicates that j is in the barrier. Typically, a thread with identifer i would enter the barrier by calling EnterBarrier(i), yield to other threads by calling ThreadInv(i), and then exit the barrier by calling ExitBarrier(i). The linearity of parameter j of ThreadInv and parameter i of ExitBarrier allows us to assume that j and i are distinct, and therefore ThreadInv is preserved by ExitBarrier. Preservation by EnterBarrier is trivial since this action only adds elements to barrier.

Permission redistribution. Now consider the following yield invariant BarrierInv that indicates that the sum of the size of barrier and count is equal to N, the total number of threads.

```
var {:layer 0,1} barrier: [Tid]bool;
var {:layer 0,1} count: int;
procedure {:atomic} {:layer 1} EnterBarrier(
  {:linear "tid"} i: Tid)
modifies barrier;
{
    barrier[i] := true;
    count := count - 1;
}
procedure {:atomic} {:layer 1} ExitBarrier(
  {:linear "tid"} i: Tid)
modifies barrier;
{
    assert barrier[i];
    barrier[i] := false;
    count := count + 1;
}
procedure {:yield_invariant} {:layer 1} ThreadInv(
  {:linear "tid"} j: Tid);
requires barrier[j];
```
Fig. 5. Using permissions to prove invariants.

procedure {:yield\_invariant} {:layer 1} BarrierInv(); requires Size(barrier) + count == N;

This invariant cannot be proved on the code in Figure 5. The action EnterBarrier does not preserve BarrierInv whenever barrier[i] already holds upon entry. This condition, of course, cannot happen in the program, since a thread only calls EnterBarrier when it is outside the barrier. But this constraint is not encoded in the current specifcation. An attempt to encode this constraint would be to make the global variable barrier linear. However, this strategy would force us to drop the linear annotation on parameter i of ExitBarrier which would then make ThreadInv unprovable.

To solve this programming problem, we present a more sophisticated use of permissions that depends on custom collectors and new linearity annotations on local variables. The datatype declaration

```
type {:linear "perm"} {:datatype} Perm;
function {:constructor} Left(i: Tid): Perm;
function {:constructor} Right(i: Tid): Perm;
```
specifes the permissions for a new linear domain perm. The datatype Perm has two constructors Left and Right; each constructor wraps a thread identifer to create a Perm value. The collectors for perm are shown below.

```
function {:linear "perm"} TidCol(x: Tid) : [Perm]bool
{ MapConst(false)[Left(x) := true][Right(x) := true] }
function {:linear "perm"} TidSetCol(xs: [Tid]bool)
: [Perm]bool
{ (lambda p: Perm :: is#Left(p) && xs[i#Left(p)]) }
```
The collector TidCol defnes the permissions stored in a single thread identifer x as the set comprising Left(x) and Right(x). The collector TidSetCol collects the permissions in a set of thread identifers xs by collecting Left(x) for each element x in xs. Additionally, there is the following default collector for type Perm.

```
function {:linear "perm"} PermCol(x: Perm) : [Perm]bool
{ MapConst(false)[x := true] }
```
Figure 6 shows the revised code for our example which now uses the linear domain perm throughout. The global

```
var {:layer 0,1} {:linear "perm"} barrier: [Tid]bool;
var {:layer 0,1} count: int;
procedure {:atomic} {:layer 1} EnterBarrier(
  {:linear_in "perm"} i: Tid)
returns ({:linear "perm"} p: Perm)
modifies barrier;
{
    barrier[i] := true;
    count := count - 1;
    p := Right(i);
}
procedure {:atomic} {:layer 1} ExitBarrier(
  {:linear_in "perm"} p: Perm, {:linear_out "perm"} i: Tid)
modifies barrier;
{
    assert p == Right(i) && barrier[i];
    barrier[i] := false;
    count := count + 1;
}
procedure {:yield_invariant} {:layer 1} ThreadInv(
  {:linear "perm"} p: Perm, j: Tid);
requires p == Right(j) && barrier[j];
```
Fig. 6. Permission redistribution in atomic actions.

variable barrier is linear and consequently a store of permissions. The signatures and implementation of EnterBarrier, ExitBarrier, and ThreadInv have also changed.

We now present the intuition behind the revised implementation. EnterBarrier splits the permissions {Left(i), Right(i)} contained in its input parameter i into Left(i) which is put into barrier and Right(i) which is returned via the output parameter p. The linear\_in annotation on i indicates that the permissions in i are consumed by the call and are therefore unavailable afterwards. The permission p and the unavailable thread identifer i are used to call ThreadInv. Finally, when ExitBarrier is called with p and i and i is removed from barrier, the permission Left(i) is also removed from barrier. This permission becomes available to be joined with Right(i) contained in p so that the full permission set {Left(i), Right(i)} is put into i which becomes available after the call. This protocol is indicated by the linear\_in annotation on p and the linear\_out annotation on i.

This example shows that permissions can be redistributed without duplication by an atomic action among global variables and its parameters. This ability to soundly redistribute permissions allows us to compactly express and prove coordination protocols.

Asynchrony. Asynchronous invocations—calls that create a new concurrent thread of computation without the caller waiting for the operation to complete—are challenging to specify and verify. Civl provides the inductive sequentialization [13] proof rule to sidestep the arduous task of inventing complex inductive invariants that capture all possible interleavings of an asynchronous program.

Consider the action ASYNC\_SUM in Figure 7. It uses an output variable PAs that represents *pending asyncs*, asynchronous operations that are spawned by ASYNC\_SUM but executed asynchronously at some later time. Concretely, ASYNC\_SUM creates the multiset of pending asyncs set\_of\_ADD(1, n) = {ADD(1), ADD(2), . . . , ADD(n)}, which could be refned to a

```
procedure {:atomic}{:layer 1}{:IS "SUM","INV"}{:elim "ADD"}
ASYNC_SUM (n: int)
returns ({:pending_async "ADD"} PAs:[PA]int)
modifies x;
{
  assert n >= 0;
  PAs := set_of_ADD(1, n);
}
procedure {:atomic}{:layer 2} SUM (n: int)
modifies x;
{
  assert n >= 0;
  x := x + (n * (n+1)) div 2;
}
procedure {:left}{:layer 1} ADD (i: int)
modifies x;
{ x := x + i; }
procedure {:IS_invariant}{:layer 1} INV (n: int)
returns ({:pending_async "ADD"} PAs:[PA]int,
         {:choice} choice:PA)
modifies x;
{
  var i: int;
  assert n >= 0;
  assume 0 <= i && i <= n;
  x := x + (i * (i+1)) div 2;
  PAs := set_of_ADD(i+1, n);
  choice := ADD(i+1);
}
```
procedure that asynchronously invokes ADD in a while loop.

The annotations on ASYNC\_SUM tell Civl instead to convert it into SUM, by *eliminating* from it the pending asyncs to ADD using the *invariant action* INV. SUM adds to x the value n(n+1) 2 , which is the cumulative effect of the asynchronous ADD operations. The key is that INV only talks about a single interleaving of the ADD operations: ADD(1); ADD(2); . . . ; ADD(n). It represents any prefx of this single interleaving as follows. It (1) nondeterministically picks i between 0 and n denoting the number of fnished ADD's, (2) increases x by i(i+1) 2 to capture the effect of executing ADD(1) to ADD(i), (3) creates pending asyncs for ADD(i+1) to ADD(n), and (4) specifes that the next pending async we wish to execute in our sequential order is ADD(i+1). INV represents ASYNC\_SUM with i = 0, SUM with i = n, and the induction order from i to i+1 is specifed by the user through the output variable choice. The justifcation for this sequential reduction is that ADD is a left mover, and thus can always be commuted to the desired location in the sequentialization.

#### IV. IMPLEMENTATION

Civl is implemented as a conservative extension of the Boogie verifer. The extensions to the syntax (Section III) and the verifcation engine do not affect ordinary Boogie programs. The Boogie verifer itself is implemented as a pipeline with a sequence of phases—parsing, type checking, verifcation condition generation, solver invocation, and error reporting. For every procedure, a verifcation condition in SMT-LIB format is passed to an SMT solver running in a separate process. If an error is discovered, a diagnostic error trace is calculated by examining the model returned by the solver.

The implementation of Civl adds two more phases into the pipeline of the Boogie verifer. Initially, the Civl attributes are parsed together with the rest of the Boogie program and the standard Boogie type checker is run. Then, the *Civl type checker* validates the Civl attributes and converts them into internal data structures. Next, the *Civl processor* compiles all proof obligations related to concurrency down to sequential Boogie procedures. Finally, the existing Boogie pipeline for converting procedures into verifcation conditions takes over.

Civl type checker. The type checker has three main parts.

First, a *layer analysis* [8] checks that the layer annotations are consistent. This analysis ensures that all program layers encoded by the input layered program are well-formed, e.g., that variables accessed and procedures/actions called on some layer actually exist on that layer. It also ensures the soundness of our refnement check. For example, in Figure 4 we could not refne Client at layer 1, because its callee Acquire frst needs to be converted to the action AcquireSpec, which happens from layer 1 to layer 2. For sound variable introduction, only introduction actions and invariants are allowed to access global variables at their introduction layer. For example, at layer 1 only set\_l and LockInv refer to l, whereas AcquireSpec only refers to it at layer 2.

Second, a *yield suffciency analysis* [7] checks, for each layer separately, that it is safe to consider context switches only at yield points. This check is implemented by computing a simulation relation [16] between a labeled control-fow graph and a specifcation automaton that encodes all sequences of mover types allowed by Lipton's reduction theorem [6]. The specifcation automaton is shown in panel ① of Figure 8. Panel ② shows the labeled graph for procedure Acqurie at layer 1. Node n<sup>0</sup> represents the loop head. Since the loop is yielding, the edge to the loop condition n<sup>1</sup> is labeled Y. At n<sup>1</sup> we either exit the loop and thus the entire procedure on the private edge to n3, or we execute the non-mover CAS\_b on the edge to n<sup>2</sup> labeled N. At n2, corresponding to the if condition, we either execute the introduction action set\_l and break from the loop, or we loop back to the loop head n0, both of which are private edges. Panels ③ and ④ show that the calls to the yielding procedures Acquire and Release are labeled with Y at layer 1 but with the mover type of their respective refned atomic action at layer 2. For simplicity, Civl does not allow a yield-to-yield fragment that starts within a loop to wrap around the loop head, and thus checks that every loop that contains a Y edge is a yielding loop.

Third, a *linear fow analysis* [14] computes the available linear variables at each control location of a procedure, and ensures that calls to procedures, atomic actions, and yield invariants satisfy their linear interfaces. The following code snippet refers to Figure 6.

```
// i available, p unavailable
call p := EnterBarrier(i);
// i unavailable, p available
call ThreadInv(p, i);
// i unavailable, p available
call ExitBarrier(p, i);
// i available, p unavailable
```
Fig. 8. Labeled control-fow graphs for yield suffciency analysis of Figure 4. ① Specifcation automation. ② Acquire at layer 1. ③ Client at layer 1. ④ Client at layer 2.

EnterBarrier requires i to be available and consumes it, making p available in return. The unavailable i can be used in places where it is not required to be linear, in particular the calls to ThreadInv and ExitBarrier. After ExitBarrier which consumes p, variable i is available again.

Civl processor. To target Boogie's verifcation-condition generator, Civl eliminates layers, concurrency, and linearity from the input layered concurrent program by creating a collection of sequential *checker procedures*. There are two advantages to this approach. First, modular decomposition into checker procedures improves scalability by creating small verifcation problems. Second, verifcation failures in checker procedures are processed to create targeted error messages. In the following we explain the categories of checker procedures Civl generates. We do not have the space to present detailed encodings; we suggest that interested readers use the command-line fag -civlDesugaredFile to inspect the plain Boogie program generated by the Civl processor.

A common functionality required by multiple checker procedures is the computation of a logical transition relation from the code representation of an atomic action. For each code path, Civl computes a path constraint from its static single assignment form, and then iteratively eliminates intermediate copies of variables by fnding and inlining defnitions. Variables that cannot be eliminated are existentially quantifed. The transition relation is the disjunction over all path formulas.

Permission redistribution among linear variables occurs through assignment, parameter passing, and mutation in atomic actions. The frst two sources of redistribution are tracked by the syntactic fow analysis in the Civl type checker. For the third source, a checker procedure for each atomic action ensures that no permission duplication occurs due to its execution. This semantic check involves user-supplied collector functions. For example, the checker procedure for ExitBarrier from Figure 6 validates the postcondition

> TidSetCol(barrier) ⊎ TidCol(i) ⊆ TidSetCol(old(barrier)) ⊎ PermCol(old(p)),

stating that the permissions fowing into the action through barrier and p must be a subset of the permissions fowing out through barrier and i. The resulting non-duplication guarantee among linear variables is injected into all the following checks as a free assumption.

```
procedure CommutativityChecker(tid_1: Tid, tid_2: Tid)
requires tid_1 != tid_2; // derived from linearity
requires l == Some(tid_2); // gate of ReleaseSpec
modifies b, l;
{
  call AcquireSpec(tid_1); // inlined
  call ReleaseSpec(tid_2); // inlined
  // trans. rel. of ReleaseSpec(tid_2); AcquireSpec(tid_1)
  assert l == Some(tid_1);
}
```
Fig. 9. Commutativity checker for AcquireSpec and ReleaseSpec.

The mover type of each atomic action is verifed by pairwise checks against every atomic action with an overlapping layer range. Each such check is encoded by multiple checker procedures to account for commutativity of both failing and successful behaviors. For example, the commutativity check between AcquireSpec and ReleaseSpec is shown in Figure 9. Recall that this check succeeds because the frst call blocks due to the constraint we get from linearity. In addition, each left mover and introduction action is separately checked to have a failing or successful behavior from each initial state.

Invariants are verifed separately for each layer n, resulting in a checker procedure for each yielding procedure with disappearing layer at least n. Civl constructs the checker procedure from the code of the yielding procedure as follows. First, calls to invariants and introduction actions at layers other than n are dropped and calls to yielding procedures with disappearing layers lower than n are rewritten to calls of their respective refned actions. Next, asynchronous and parallel calls (of which ordinary calls are a special case) are translated. An asynchronous call to a yielding procedure is translated into an assertion of the precondition of the procedure. An asynchronous call to an action is either synchronized or converted into a pending async [12]. A parallel call may contain arms that are actions, yield invariants, or yielding procedures. Each such call is rewritten into a sequence comprising calls to actions and parallel calls whose arms are either yield invariants or yielding procedures. For example, par A | P | I | B | C | Q | D with actions A, B, C and D, procedures P and Q, and invariant I, is rewritten to call A; par P | I; call B; call C; par Q; call D. All calls to atomic actions are inlined. Any parallel call remaining at this point is a yield where interference is possible. Next, each yield is instrumented to record a snapshot of the global variables immediately after the yield. This snapshot is used to assert the preservation of all invariants in the program at the end of a yield-to-yield fragment. Finally, each parallel call (with arms that are yielding procedures or yield invariants) comprising a yield is itself desugared as follows: (1) assert preconditions of yielding procedures and yield invariants, (2) havoc all global variables, (3) assume postconditions of yielding procedures and yield invariants. The soundness of this translation of concurrent code to sequential code is ensured by the yield suffciency analysis of the Civl type checker. A side condition for asynchronous calls forbids global state updates between an asynchronous call to a yielding procedure and the next yield point. Additionally, there are restrictions on the sequence of arms in a parallel call. For example, any left mover must occur before any right mover, and there cannot be both a yielding procedure and a non-mover in the sequence.

At the disappearing layer n of every yielding procedure, a checker procedure verifes refnement of the specifed atomic action by tracking two local Boolean variables, pc and ok, each initialized to false. The variable pc is set to true as soon as a yield-to-yield fragment modifes any layer-(n + 1) state; before any such modifcation it is asserted that pc is false. The variable ok is set to true as soon as a yield-toyield fragment modifes the layer-(n + 1) state according to a transition admitted by the refned action; ok is asserted to be true when the procedure returns. Overall, we check that layer- (n + 1) state is modifed at most once, and that a behavior of the refned action occurs at least once.

Each invocation of the inductive sequentialization [13] rule results in a collection of checker procedures, one each for the base and conclusion case and one for the inductive step corresponding to each eliminated pending async.

#### V. EXPERIENCE

Civl has been used in many efforts to develop verifed concurrent systems, both by the authors of Civl and by other researchers. These efforts include a concurrent garbage collector [9], a Paxos implementation [13], and implementations of concurrent data structures: the FastTrack data-race detector [17], Chase-Lev deque [18], and Java weakly-consistent objects [19]. Civl has also been used to prototype techniques for verifcation under TSO semantics [20]. Civl is fast enough to be used for interactive development. Even on our large benchmarks, verifcation time is a few seconds.

Our experience suggests that Civl's specifcation mechanisms—layering, commutativity, yield invariants are natural for users. These features aid discovery of provable implementations by encouraging the user to think about different layers of abstraction, the primitives for each layer, and suitable organization of the reasoning technique at each layer. In addition, layers enable partitioning of work among multiple developers each working on the proof of a particular layer with agreed-upon interfaces between layers.

We present more details about two major case studies to provide anecdotal evidence for the improvements in developing verifed concurrent systems enabled by Civl.

Concurrent Garbage Collector. An author of this paper together with other researchers used Civl to develop a verifed concurrent garbage collector and object allocator that improves upon the mark-and-sweep garbage collector by Dijkstra el al. [21] in two ways. First, the new collector supports more than one mutator running in parallel with the collector. Second, it requires a write-barrier only on updates of heap pointers but not on root modifcations. The Civl implementation is realistic, given in terms of individual CPU operations. The refned specifcation comprises high-level atomic actions for object allocation and access, that provide the illusion of unbounded memory in which individual objects are not reused.

The proof is done via a sequence of 6 program transformations connecting 7 program layers. Layer 0 is described in terms of individual atomic CPU operations. Layer 0 → 1 introduces locks and atomic actions for read/write accesses. Layer 1 → 2 uses the locks and protected accesses to construct higher-level atomic operations that are used in the barrier synchronization algorithm for root scanning and in the marksweep algorithm. The collector operates in three phases idle, mark, and sweep. Layer 2 → 3 reasons about the coordination between the collector and the mutators to make phase changes safely. The mark algorithm performs a depthfrst search of the heap starting from the roots. The stack in this search comprises "gray" objects. Layer 3 → 4 changes the representation of the gray objects to a set. Layer 4 → 5 reasons about the root scanning algorithm that internally uses barrier synchronization to create an atomic action that scans all roots in one step. Reasoning about the write barrier also happens during this transformation. Layer 5 → 6 reasons about the mark-sweep algorithm using the atomic actions for scanning roots, maintaining the set of gray objects, and changing object colors. The garbage collector is hidden entirely, leaving the client with atomic actions for allocating objects, reading and writing object felds, and checking object equality.

This proof was constructed and reported in 2015 [9]. Since then, Civl has been rewritten but the proof has been maintained and improved. The current artifact is 2031 LOC and verifes in 25s on a standard Mac. The biggest improvement happened with the introduction of yield invariants [14] which reduced the verifcation time by a factor of 10.

Paxos. The Paxos protocol [22] establishes consensus among a set of unreliable nodes in an asynchronous network without a central coordinator. This protocol lies at the core of any system with replicated state. It is diffcult to both understand and implement. The authors of this paper together with other researchers constructed a verifed implementation [13] of single-decree Paxos, which establishes consensus on a single value. The verifed implementation only uses primitive atomic actions, like reading or writing a single memory address, and sending or receiving a single message.

The proof is constructed via a sequence of 2 program transformations done over 3 layers. Layer 0 implements event handlers using primitive atomic actions for sending and receiving network messages, and for updates to the local state and decision variable at each Paxos node. The transformation from layer 0 to layer 1 converts event handlers to atomic actions at the granularity typically used to describe protocols in papers. At the same time, this transformation changes the state representation to make it easier to apply the next transformation. The invariant justifying this transformation simply connects the two state representations. The transformation from layer 1 to layer 2 uses inductive sequentialization [13] to create a single atomic action where consensus is reached in one step by nondeterministically setting decisions at each node consistently. The invariant justifying this transformation captures the intuition of the protocol. It has 4 conjuncts and is considerably simpler than the invariants in other published proofs of the Paxos protocol. For example, the proof [23] using Ivy has 5 other supporting invariants in addition to the 4 used in the Civl proof. The current artifact for the Civl proof is 1116 LOC and verifes in 7s on a standard Mac.

#### VI. RELATED WORK

In this section we compare Civl to other *reusable tools* that have *support for concurrency*.

TLA+ [24] and Event-B [25] are two classic tools for refnement reasoning over transition systems. Ivy [26] verifes transition systems using a restricted modeling and specifcation language (notably without functions and arbitrary quantifcation) that makes verifcation conditions decidable. While Ivy requires manual effort to encode distributed systems concepts in this restricted language, Civl requires manual effort to automate quantifer reasoning. Ivy also has a synchronous, reactive programming language from which it can extract asynchronous, distributed implementations [27]. This programming model, which cannot express fne-grained concurrency, can be encoded in Civl by threading a linear parameter through atomic actions and procedures. Ivy provides liveness reasoning and information hiding via modules.

Iris [28] is a Coq-based formalization of a program logic suitable for reasoning about fne-grained concurrent programs with higher-order ghost state. The focus in Iris is to clarify and simplify concurrent separation logics around a few primitive concepts in order to provide a suitable foundation for developing reasoning mechanisms for concurrent programs. Compared to Iris, Civl is less fexible but provides more automation on a programming notation that supports standard models of concurrent programming. ReLoC [29] is a logic built on top of Iris for interactively proving contextual refnement judgments.

Chalice [30] verifes monitor invariants, in addition to absence of data races and deadlocks, on a small Java-like concurrent programming language. VeriFast [31] supports separation logic specifcations, resource invariants, and higher-order ghost state on concurrent C and Java programs. Prusti [32] uses the guarantees of the Rust type system to simplify the manual annotation effort. VerCors [33] builds on separation logic specifcations and provides verifcation features for several concurrent programming idioms, e.g., based on histories and process algebra. VCC [34] is a verifer for concurrent C programs. VCC allows the programmer to construct a custom verifcation methodology via extensive support for the introduction of ghost types and values. Noninterference is accomplished via a network of type-level global invariants which together must satisfy certain stability and admissibility conditions. Similar to Civl, these tools use SMT solvers as the reasoning engine, exploit programmer interaction, and support modular reasoning. Civl provides features not present in these tools such as layered refnement and yield invariants.

Anchor [35], a successor to Calvin-R [36], is a lightweight verifer for a small Java-like programming language. Anchor allows the programmer to compactly specify conditional mover types for read and write accesses of shared object felds. It is less modular than Civl and other tools discussed here; inlining is used extensively to deal with procedure calls.

Armada [37] is a language and verifer that implements layers, mover types, and explicit noninterference reasoning. Armada is inspired by Civl but also supports weak memory and extensibility via new simplifcation tactics. While Civl represents all program layers in a single layered concurrent program, Armada connects explicitly written programs using proof scripts that invoke mechanized theorems.

#### VII. CONCLUSION

The Civl static verifer aids the development of verifed concurrent systems through language-integrated proof structuring mechanisms, an array of program-simplifying proof tactics, and modular and automatable verifcation conditions. The modeling features provided in Civl are general; they can be specialized to many different domains by building custom linguistic support and automation. For example, it is possible to use Civl as the verifcation backend for domainspecifc languages suitable for developing implementations of distributed protocols, concurrent data structures, or even system-level hardware implementations. Overall, Civl opens many new opportunities in development of programming tools for concurrent systems.

Civl's capabilities to generate verifcation conditions for checking commutativity, refnement, and noninterference can be leveraged individually by a verifer. It is also conceivable to design a programming language that supports layering and atomic actions natively, and uses Civl as a backend for verifcation. This language would generate executable code from the lowest-layer program which invokes atomic actions whose implementation is provided by the language runtime.

Our experience suggests that progress on the following important challenges should increase the applicability and usability of Civl. First, Civl's verifcation conditions have quantifers which can results in unpredictable verifcation times. Domainspecifc techniques for automatic quantifer instantiation or language mechanisms for conveniently specifying instances would help. Second, Civl supports linear maps [38] for reasoning about disjoint but fat memory. Extension to support reasoning about nested linear maps would make it easier to encode standard heap programming models. Third, layered programs in Civl are challenging to comprehend, edit, and refactor; tools to help with these tasks would be helpful. A module system for factoring out libraries and their layered proofs would aid the development of large verifed systems.

#### REFERENCES


# Synthesizing Pareto-Optimal Interpretations for Black-Box Models

Hazem Torfah<sup>1</sup> , Shetal Shah<sup>2</sup> , Supratik Chakraborty<sup>2</sup> , S. Akshay<sup>2</sup> , Sanjit A. Seshia<sup>1</sup>

> <sup>1</sup>*University of California at Berkeley* {torfah, sseshia}@berkeley.edu 2 *Indian Institute of Technology Bombay* {shetals, supratik, akshayss}@cse.iitb.ac.in

*Abstract*—We present a new multi-objective optimization approach for synthesizing interpretations that "explain" the behavior of black-box machine learning models. Constructing *human-understandable* interpretations for black-box models often requires balancing conflicting objectives. A simple interpretation may be easier to understand for humans while being less precise in its predictions vis-a-vis a complex interpretation. Existing methods for synthesizing interpretations use a single objective function and are often optimized for a single class of interpretations. In contrast, we provide a more general and multi-objective synthesis framework that allows users to choose (1) the class of syntactic templates from which an interpretation should be synthesized, and (2) quantitative measures on both the correctness and explainability of an interpretation. For a given black-box, our approach yields a set of Pareto-optimal interpretations with respect to the correctness and explainability measures. We show that the underlying multi-objective optimization problem can be solved via a reduction to quantitative constraint solving, such as weighted maximum satisfiability. To demonstrate the benefits of our approach, we have applied it to synthesize interpretations for black-box neural-network classifiers. Our experiments show that there often exists a rich and varied set of choices for interpretations that are missed by existing approaches.

# I. INTRODUCTION

Machine learning (ML) components, especially deep neural networks (DNNs), are increasingly being deployed in domains where trustworthiness and accountability are major concerns. Such domains include health care [5], automotive systems [28], finance [21], loans and mortgages [25], [33], and cyber-security [10] among others. For a system to be considered accountable and trustworthy, it is necessary to provide understandable explanations to (possibly expert) humans of why the system took specific actions/decisions in response to inputs of concern. This requires the availability of models that are human-understandable, and that also predict the outcome of different components of the system with reasonable accuracy. Laws and regulations, such as the General Data Protection Regulation (GDPR) in Europe [1], are already emerging with requirements on explainability of ML components in such systems. Unfortunately, the working of ML components like DNNs can be extremely complex to comprehend, and more so when the components are used as black boxes. Therefore, there is an urgent need for automated techniques that generate "easy-to-understand" and "targeted" interpretations of blackbox ML components, with formal guarantees about tradeoffs between correctness and explainability.

Synthesizing a "good" interpretation of a black-box ML component often requires striking the right balance between correctness or accuracy of the interpretation (measured in terms of fidelity, misclassification rate of predictions etc.) and explainability or understandability (approximated by the size of the ML model – e.g., depth of decision tree/list/diagram, number and nature of predicates used, etc.). In most cases, the correctness and explainability measures are in direct conflict with each other. Thus, a simple interpretation that is easily understood by humans may disagree in its predictions with the output of a black-box ML component for many input instances, whereas an interpretation that correctly predicts the output for most input instances may be too large and unwieldy for human comprehension. This is not surprising since components like DNNs are often used to learn highly non-trivial functions for which simple models are not available. Therefore, *synthesis of interpretations for black-box ML components is inherently a multi-objective optimization problem with conflicting objectives, and Pareto optimality is the best we can hope for when synthesizing such interpretations.*

The literature contains a rich collection of techniques for synthesis of interpretations for black-box ML components (see, for example, recent surveys by [2] and [13]). Most of these approaches optimize a single correctness measure (e.g. misclassification rate on a set of samples) while systematically constraining some explainability measure (e.g. number of nodes or depth of a decision tree). Examples of such techniques include [19] wherein sparse logical formulae are synthesized, and also recent approaches to learning optimal decision trees using constraint programming [35]–[37], itemset/rulelist mining [3] and SAT-based techniques [6], [18], [27], among others. These approaches often allow efficient generation of a *single* interpretation with high correctness measure and satisfying user-provided explainability constraints. However, no formal guarantees of Pareto-optimality (w.r.t. correctness and explainability) are provided. Furthermore, these techniques do not compute the set of *all* Pareto-optimal interpretations, thereby constraining the choice of which interpretation to use for a given application.

In this paper, we present a novel multi-objective optimization approach for synthesizing Pareto-optimal interpretations of black-box ML components, using an off-the-shelf quantitative constraint solver (weighted MaxSAT solver in

our case). For each problem instance, our approach yields a set of interpretations that correspond to *all* Pareto-optimal combinations of correctness and explainability measures. This contrasts sharply with earlier approaches such as [3], [6], [18], [19], [27], [35]–[37] that always yield a single interpretation, leaving the user with no choice of exploring the trade-off between correctness and explainability of alternative interpretations. Similar to existing work, we use syntactic constraints to restrict the class of interpretations over which to search. Unlike earlier approaches, however, we do not combine quantitative correctness and explainability measures into a single optimization objective. Any such mapping of an inherently multi-dimensional optimization problem to the uni-dimensional case results in exclusion of some Paretooptimal solutions in general. Given that quantitative explainability measures are often just approximations of subjective preferences of the end-user, we believe it is important to present the entire set of Pareto-optimal interpretations, and leave the choice of the "best" interpretation to the user. As our experiments show, there is significant diversity among Paretooptimal interpretations, and a user aware of this diversity can make an informed choice for a specific application.

The syntactic constraints considered in this paper restrict the space of interpretations to decision diagrams (a generalization of decision trees) with specified bounds on the number of nodes, predicates and branching factors. For simplicity, we let the set of predicates be pre-determined but potentially large, and with possibly different relative preferences for different predicates. We focus on the setting where the black-box ML model can only be treated as an input-output oracle, i.e., given an input, we can observe its output and nothing else. Additionally, we do not have access to training or test data used to create the black-box component. Our correctness measure is therefore based on querying the black-box component with random samples chosen from its input space, where the sample set size is carefully chosen to provide statistical guarantees of near-optimality. Our explainability measure takes into account user preferences of predicates and also size of the interpretation, prefering smaller interpretations over larger ones. The overall framework is, however, general enough to admit other syntactic classes (beyond decision diagrams), and also other correctness and explainability measures.

We have implemented our approach in a prototype tool and applied it to synthesize Pareto-optimal interpretations for some black-box neural network classifiers. Our results exhibit the richness of choices available to the end-user in each case, none of which would be exposed by existing methods that generate only a single optimal interpretation. Indeed, we find that significant improvements in explainability can sometimes be achieved by only a marginal reduction of accuracy.

Our primary contributions can be summarized as follows:


problem, for some meaningful choices of correctness and explainability scores.


#### II. MOTIVATING EXAMPLE

We start with an example, adapted from [11], that illustrates the diversity that exists among Pareto-optimal interpretations of black-box ML models. Consider a scenario where an airplane uses a neural network to autonomously taxi along a runway, relying on a camera sensor. Suppose the plane is expected to follow the runway centerline within a tolerance of 2.5 meters. The airplane is equipped with monitoring modules that decide under what circumstances certain learningenabled components can be trusted to behave correctly. One of these monitoring modules decides under what conditions the camera-based perception module, that determines the distance to the centerline, can be trusted to deliver the right values. For example, the monitoring module may use the weather condition, time of day, and initial positioning of the airplane to decide whether the perception module's output is reliable. We wish to reason about this black-box monitoring module, and hence need an understandable interpretation for it.

Given a set of user-defined predicates (viz. clouds, time of day, and initial position of the plane), the user may favor certain predicates over others, and also favor concise interpretations. By giving favorability weights to each predicate, we can define an explainability score that is related to the number of nodes in the interpretation and also to the predicates used (this is detailed later). The prediction accuracy of an interpretation is measured w.r.t a set of examples sampled from the black box, and is represented by a correctness score. Our approach explores the space of interpretations, searching for concise interpretations that use more favored predicates and also have high accuracy. Clearly, to find a "good" interpretation that meets these conflicting goals, one must explore *all* Pareto-optimal interpretations w.r.t. the criteria above.

Figure 1 shows three of the many Pareto-optimal interpretations our approach synthesized for the monitoring black-box. Each of these has its own pros and cons, and is incomparable with the others. The user can now choose the interpretation that best suits the user's purpose. For example, if interpretation size is not of concern but accuracy is, then Figure 1(b) is the best choice. However, if the user wants concise models with favored predicates (related to time of day and initial position), then Figure 1(a) is the best choice. The user may also choose the interpretation in Figure 1(c), which is only

(a) Pareto-optimal interpretation with correctness measure c = 0.61 and explainability measure e = 0.95

boo *time* boo

[12pm,8am)

alert no alert

[8am,12pm)

(b) Pareto-optimal interpretation with correctness measure c = 0.94, explainability measure e = 0.71

(c) Pareto-optimal interpretation with correctness measure c = 0.90 and explainability measure e = 0.89

Fig. 1. Pareto-optimal decision diagram interpretations for the black-box monitoring component that decides based on time of day, cloud types, and initial position of an airplane whether to trust a perception module to help the plane track the centerline of a runway. The correctness score is given by the prediction accuracy w.r.t. to the used sample set. The explainability score is the normalized sum of weights of used predicates and unused nodes.

slightly less accurate than that in Figure 1(b), but has a higher explainability score. In fact, Figure 1(c) represents a healthy balance between accuracy and explainability. According to it, the perception module can be trusted only during morning hours if the plane starts no more than 2.5m from the centerline, or at any time if the plane starts within 0.5m of the centerline.

Tools that use a single-objective function to synthesize interpretations can only find one of these Pareto-optimal interpretations, depending on the relative weights given to accuracy and explainability. The rich diversity among Pareto-optimal interpretations is completely missed by such tools, effectively restricting the user's choice of a "good" interpretation.

#### III. PARETO-OPTIMAL INTERPRETATION SYNTHESIS

In this section, we formalize the Pareto-optimal interpretation synthesis problem and present a solution (for specific choices of correctness and explainability scores) using a quantitative constraint satisfaction engine. In our case, this engine is an off-the-shelf weighted maximum satisfiability solver. The key idea is that the user sets syntactic restrictions on the class of considered interpretations as well as quantitative objectives for evaluating the interpretations. The quantitative objectives are defined using two inherently incomparable measures – the explainability measure and the correctness measure. The explainability measure relates to "ease" of understanding of the interpretation by an end-user, while the correctness measure relates to how precisely the interpretation explains the behavior of the black-box model on a given set of samples. Examples of quantitative correctness measures include accuracy, recall, precision, F1-score [34], while examples of explainability measures include those that reward usage of concise interpretations and less complex predicates.

Since our access to the black-box model is only via input/output samples, the correctness measure referred to above is defined with respect to a set of samples, and not with respect to the black-box model in its entirety. While this may appear ad-hoc at first sight, we show in Section IV that rigorous statistical guarantees can indeed be provided with sufficiently many samples.

#### *A. Formal problem definition*

We now give a formal definition of the Pareto-optimal interpretation synthesis problem. An interpretation is simply a syntactic structure, viz. decision tree, decision diagram, linear model, etc. We will fix a class of interpretations E over an input domain I and output domain O. For an interpretation E ∈ E, we define f<sup>E</sup> ∈ (I → O) to be the semantic function that is computed by E. Note that different interpretations may compute the same semantic function.

Every interpretation E ∈ E is associated with a pair of realvalued measures (c, e), where c is the correctness measure and e is the explainability measure of E. We define a partial order on such pairs as: (c, e) (c 0 , e<sup>0</sup> ) iff c ≤ c 0 and e ≤ e 0 . Given a set X of (c, e) pairs, we define max X to be the set of -maximal pairs in X. An interpretation E with the pair of measures (c, e) is said to be *Pareto-optimal* if (c, e) is maximal over pairs of measures of all interpretations.

*Definition 1 (Pareto-optimal interpretation synthesis):* Let E be a syntactic class of interpretations over inputs I and outputs O. Further, let S ⊆ I × O be a set of samples, ∆<sup>C</sup> : (I → O)×2 (I×O) → R <sup>≥</sup><sup>0</sup> be a correctness measure, and ∆<sup>E</sup> : E → R ≥0 an explainability measure. The Pareto-optimal interpretation synthesis problem hE, S, ∆C, ∆<sup>E</sup> i is the multiobjective problem of finding a Pareto-optimal interpretation E ∈ arg max <sup>E</sup>0∈ E (∆C(fE<sup>0</sup> , S), ∆<sup>E</sup> (E<sup>0</sup> )).

We interpret ∆C(fE, S) as a measure of closeness between the semantic function f<sup>E</sup> of interpretation E and the semantic constraints defined by a set S of samples. An optimally correct interpretation is one with maximal closeness. An example of such a measure is the *prediction accuracy* |{(i,o)∈S|fE(i)=o}| |S| . The problem can also be defined in terms of the "distance" between an interpretation and the semantic constraints defined by S, in which case, the optimization problem is one of minimization. An example of such a measure is the *misclassification rate*, which is one minus the prediction accuracy. Similarly, for ∆<sup>E</sup> (·), we choose to define it as a reward function that we want to maximize, but it can also be dually defined as a cost function we want to minimize.

For each -maximal pair of measures, there can be multiple corresponding interpretations realizing the measures. We don't distinguish between them for purposes of this paper. The following definition is therefore relevant.

*Definition 2 (Minimal representative set):* A set Γ of Pareto-optimal interpretations is a minimal representative set for hE, S, ∆C, ∆<sup>E</sup> i if for every (c, e) ∈ max <sup>E</sup>∈E (∆C(fE, S), ∆<sup>E</sup> (E)), there is exactly one interpretation E<sup>0</sup> ∈ Γ such that (∆C(fE<sup>0</sup> , S), ∆<sup>E</sup> (E<sup>0</sup> )) = (c, e). Our goal can therefore be stated as one of finding a minimal representative set of interpretations for a black-box model.

#### *B. Synthesis via weighted maximum satisfiability*

We now discuss how to synthesize one (of possibly many) Pareto-optimal interpretation for specific choices of E, ∆<sup>C</sup> and ∆<sup>E</sup> , by encoding the synthesis problem as a *weighted maximum satisfiability* problem (weighted MAXSAT). For purposes of our discussion, we choose E to be the class of *bounded multi-valued decision diagrams*, i.e., decision diagrams with multiple branching at each node, where the branching is governed by decision predicates, and with a bound on the number of decision nodes (see, e.g., diamond nodes in Figure 1). We use prediction accuracy as the correctness measure, and define the explainability measure with weights (denoting preferences) on the predicates and on the number of used nodes. The encoding for several other classes of interpretations, such as decision trees, decision rules, etc. and for other explainability and correctness measures can be done similarly.

We start by recalling the weighted MAXSAT problem. A Boolean formula ϕ over variables in a set X is said to be in conjunctive normal form (CNF) if ϕ is of the form C<sup>1</sup> ∧ C<sup>2</sup> ∧ · · · Cm, where each C<sup>i</sup> is a disjunction of literals (i.e. variables or negations of variables). An assignment σ : X → {0, 1} is an assignment of truth values to variables. If a clause C<sup>i</sup> evaluates to 1 under σ, we say σ satisfies C<sup>i</sup> , denoted by σ |= C<sup>i</sup> .

*Definition 3 (Weighted Maximum Satisfiability):* Given a Boolean formula ϕ = V<sup>m</sup> <sup>i</sup>=1 C<sup>i</sup> in CNF and a weight function w: {C1, . . . Cm} → R ≥0 that assigns a non-negative real weight to each clause, the weighted MAXSAT problem is to find an assignment σ which maximizes P {Ci<sup>|</sup> <sup>σ</sup>|=Ci} w(Ci). In a variant of the above definition, the clauses in ϕ are partitioned into *hard* and *soft* clauses. The problem now is to find an assignment σ that satisfies *all hard clauses* and maximizes the sum of weights of satisfied soft clauses. We use this variant for encoding our problem.

At a high level, for an instance hE, S, ∆C, ∆<sup>E</sup> i of the Pareto-optimal interpretation synthesis problem, we define its encoding as a conjunction of four formulae. Specifically, φhE,S,∆C,∆<sup>E</sup> <sup>i</sup> = φE∧φ<sup>S</sup> ∧φ<sup>∆</sup><sup>C</sup> ∧φ<sup>∆</sup><sup>E</sup> where, (i) φ<sup>E</sup> encodes the syntactic restrictions, i.e., bounded multi-valued decision diagrams with the permitted predicates (features and branchings) and labels; (ii) φ<sup>S</sup> encodes the semantic constraints, i.e., the relation between the samples in S and an interpretation satisfying φ<sup>E</sup> ; (iii) φ<sup>∆</sup><sup>C</sup> encodes the correctness measure, e.g., in case of prediction accuracy it encodes whether an interpretation agrees on a sample; and finally (iv) φ<sup>∆</sup><sup>E</sup> defines constraints that encode certain structural aspects of an interpretation, e.g., what predicates were chosen and whether a node was used. We discuss some details of these formulas below, leaving the full encoding to the long version of this paper at [31].

*a) Encoding of the interpretation class (*φ<sup>E</sup> *):* We start by discussing the encoding for our interpretation class of bounded multi-valued decision diagrams over inputs I and outputs O. These diagrams are restricted by a finite set of decision predicates, denoted by P. For example, in Figure 1(a), the initial node uses the "*time of day*" predicate with branchings: {[8am-12pm], [12pm-8am]}. Let L be a set of output labels, e.g., in Figure 1, we have two labels, "*alert*" and "*no alert*". An *interpretation* E ∈ E is a multi-valued decision diagram over a finite set of nodes N , where each internal node corresponds to a decision predicate p ∈ P and each leaf to an output label ` ∈ L. Outgoing transitions of a node are labelled according to the branchings of the predicate corresponding to the node. We remark that features are distinct from inputs to the black-box. For example, in the decision diagrams in Figure 1 the feature "*pos*" uses the latitude and longitude inputs to compute the initial position of the plane. Furthermore, the same predicate may appear on different nodes in the decision diagram, but not more than once along a path. For a given P, L, and a bound n on the number of nodes N in the decision diagram, the formula φ<sup>E</sup> encodes an acyclic decision diagram of at most n-nodes over a set P of predicates, with leaves labeled by elements of L.

*b) Encoding of the samples:* The formula φ<sup>S</sup> encodes the relation between the samples and the interpretation φ<sup>E</sup> . It uses an auxiliary variable m(i,o) for each sample (i, o) in the set S. Logically, m(i,o) is set to true iff the interpretation given by a satisfying assignment of φ<sup>E</sup> produces the output label o when fed the input i. For decision diagrams, this is encoded by symbolically matching the input i to a decision path in the diagram, and by comparing the value of o with that of the label reached at the end of the decision path. Note that the number of these auxiliary variables grows linearly with the size of the sample set.

*c) Encoding the correctness measure (*φ<sup>∆</sup><sup>C</sup> *):* To encode ∆C, we add a unit soft clause (i.e., a clause with only one literal) m(i,o) for each sample (i, o). By assigning appropriate weights to these unit clauses and by maximizing the sum of weights of satisfied clauses (see Definition 3), we obtain an interpretation that maximizes ∆<sup>C</sup> with respect to the sample set S. E.g., if ∆<sup>C</sup> represents the prediction accuracy, then assigning a weight of 1 to each unit clause m(i,o) gives us an interpretation that agrees on a maximal number of samples in S. If the user is interested in interpretations that agree on certain types of samples, then higher weights should be given to these samples. More precisely, to define such measures ∆C, the user can provide a function w: I × O → R, that defines these weights. For example, in the case of prediction accuracy, w is the constant function 1.

*d) Encoding the explainability measure (*φ<sup>∆</sup><sup>E</sup> *):* To encode ∆<sup>E</sup> , we add a unit clause u<sup>γ</sup> for each syntactic structure γ of an interpretation in E and give it a weight according to how important γ is. For example, in the case of decision diagrams, using some predicates may be more favorable than others. To encode this, we add unit clauses u(i,p) that are set to true iff predicate p is used in node i, and assign higher weights for clauses representing favorable predicates. Moreover, predicates with fewer branches can be favored by using soft clauses with appropriate weights. To further reward the synthesis of decision diagrams with fewer nodes, we can also add unit soft clauses u<sup>i</sup> for each node i that is set to true iff node i is not reachable from the root node in an interpretation satisfying φ<sup>E</sup> , and give them positive weights. In this case, by maximizing the satisfaction of these clauses, we reward the synthesis of small decision diagrams.

In our weighted MAXSAT formulation, we require that all clauses resulting from a Tseitin encoding (i.e., a transformation into CNF) of the formula φhE,S,∆C,∆<sup>E</sup> <sup>i</sup> , except the unit soft clauses mentioned above, be hard clauses. On feeding the above formula to a MAXSAT solver, it returns a satisfying assignment giving a concrete instantiation of the decision diagram template that maximizes the sum of weights of m(i,o) and u<sup>γ</sup> clauses.

The encoding described above is specific to a particular choice of E, ∆<sup>C</sup> and ∆<sup>E</sup> . However, similar encoding can be done for a much wider class of interpretations, and explainability and correctness measures. In fact, most types of interpretation classes used in the literature, viz. decision trees, decision diagrams, decision lists and sets of bounded depth/size admit encoding as Boolean formulas. In addition, if the computation of explainability and correctness measures can be encoded using arithmetic circuits of bounded bitwidth, the Pareto-optimal intepretation synthesis problem can be reduced to weighted MAXSAT by assigning appropriate weights to bits in the bit-vector representing the measures. The following theorem applies to our encoding, and to all other similar encodings referred to above.

*Theorem 1 (Pareto-optimality):* Every solution of the weighted MAXSAT problem φhE,S,∆C,∆<sup>E</sup> <sup>i</sup> gives a solution for the Pareto-optimal interpretation synthesis problem hE, S, ∆C, ∆<sup>E</sup> i.

#### *C. Exploring the set of Pareto-optimal interpretations*

We now present an algorithm for computing a minimal representative set of Pareto-optimal interpretations. The algorithm is based on the key observation that every Pareto-optimal measure (c, e) splits the space of measures into four regions, depicted in Figure 2(a), (1) a region R c,e 1 of measures for which there exists no solution, namely, all measures (c 0 , e<sup>0</sup> ) 6= (c, e) with c <sup>0</sup> ≥ c and e <sup>0</sup> ≥ e, otherwise (c, e) would not be Pareto-optimal, (2) a region R c,e 2 of measures that are not Pareto-optimal, namely, all points (c 0 , e<sup>0</sup> ) 6= (c, e) with c <sup>0</sup> ≤ c and e <sup>0</sup> ≤ e, (3) a region R c,e <sup>3</sup> with measures of potential Paretooptimal interpretations with better correctness measures, i.e., those with measures (c 0 , e<sup>0</sup> ) with c <sup>0</sup> > c and e <sup>0</sup> < e, and lastly (4) a region R c,e <sup>4</sup> with measures of potential Pareto-optimal interpretations with better explainability measures, i.e., points (c 0 , e<sup>0</sup> ) with c <sup>0</sup> < c and e <sup>0</sup> > e. By synthesizing a first Paretooptimal interpretation using the procedure from last section, and then dividing the search space into corresponding regions (1)-(4), our algorithm proceeds by searching for further Paretooptimal interpretations with better correctness in region (3) and better explainability in region (4). This process is repeated for every Pareto-optimal interpretation found by our algorithm, thus, directing the search into smaller and smaller regions until no new Pareto-optimal interpretation can be found.

This is detailed in Algorithm 1 and the exploration process it implements is illustrated in Figure 2. For E, S, ∆C, and ∆<sup>E</sup> , Algorithm 1 returns a minimal representative set Γ of interpretations for all Pareto-optimal measures. To synthesize a Pareto-optimal interpretation within a given region of measures, Algorithm 1 relies on the procedure QUINTSYNT which given E, S, ∆C, and ∆<sup>E</sup> , in addition to a lower-bound δ l E and upper-bound δ u E on the explainability measure, returns a Pareto-optimal interpretation E with explainability measure e such that δ l <sup>E</sup> ≤ e ≤ δ u E . QUINTSYNT effectively solves an extension of the weighted MaxSAT instance defined in the last section, in which we additionally require the explainability measure to satisfy the constraints given by the lower-bound δ l E and upper-bound δ u E . This can be done by extending the formula φ in the last section with a fifth conjunct φ<sup>δ</sup> l E ,δ<sup>u</sup> E . This conjunct is satisfied if the sum of weights of the used syntactic structures (e.g. in the case of decision diagrams, this will be sum of weights of the satisfied clauses u(i,p) and ui) lies within the given bounds. We leave details of this encoding to [31], but intuitively, we encode a binary adder that sums up the weights of satisfied u(i,p) and u<sup>i</sup> clauses and compare the results to binary encodings of the bounds. To fix the number of bits to encode both the adder and bounds, we normalize the weights to values between 0 and 1 up to a certain floating-point precision k. Now let us go further into Algorithm 1 while elaborating on why it suffices to only bound the explainability measure when exploring regions (3) and (4) depicted in Figure 2(a).

Initially, Algorithm 1 explores the entire set of Paretooptimal solution space. To this end, the exploration set W is initialized with the point (0, 1, 0) (line 2) defining a lower bound on the explainability measure, an upper-bound on the explainability measure, and a lower-bound on the correctness measure, respectively. For every point (δ l E , δ<sup>u</sup> E , δC) in W, QUINTSYNT synthesizes a Pareto-optimal region within the explainability measure bounds defined by δ l E and δ u E (line 5). If an interpretation E is found with measures c and e, i.e., E 6= ⊥ (line 6), the algorithm further divides the search space based on the following case distinction:

• if c > δC, then a new Pareto-optimal interpretation with measures (c, e) is found and the regions R c,e 3 and R c,e 4 defined by the points (δ l E , ↓e, c) and (↑e, δ<sup>u</sup> E , δC), respectively, are added to W (lines 9 and 10). The operators ↓ and ↑define the predecessor and successor value of the value e (we assume that the values are discrete and hence the predecessor and successor exist). For example, if the interpretation synthesized by QUINTSYNT is one with measures c 0 , e<sup>0</sup> as depicted in Figure 2(b), then the region

(a) First iteration: Exploring region defined by bounds (0, 1, 0). Expand W with new regions R c,e 3 and R c,e <sup>4</sup> by adding the points (0, ↓e0, c0) and (↑e0, 1, 0). No Pareto-optimal points exist in the red region.

(b) Exploring the region R c,e 3 . A new Pareto-optimal interpretation is found with measures (c 0 , e<sup>0</sup>). Add the points (0, ↓e 0 , c<sup>0</sup>) and (↑e 0 , ↓e, c) to W.

Fig. 2. An illustration of Algorithm 1.

# Algorithm 1 EXPLOREPOI

Input: E, S, ∆C, ∆<sup>E</sup> Output: Minimal representative set Γ for hE, S, ∆C, ∆<sup>E</sup> i

1: Γ := ∅ 2: W := {(0, 1, 0)} 3: while W 6= ∅ do 4: (δ l E , δ<sup>u</sup> E , δC) := pop(W) 5: (E,(c, e)) = QUINTSYNT(E, S, ∆C, ∆<sup>E</sup> , δ<sup>l</sup> E , δ<sup>u</sup> E ) 6: if E 6= ⊥ then 7: if c > δ<sup>C</sup> then 8: Γ := Γ ∪ {(E,(c, e)} 9: push(W,(δ l E , ↓e, c)) 10: push(W,(↑e, δ<sup>u</sup> E , δC)) 11: else 12: push(W,(δ l E , ↓e, δC)) 13: end if 14: end if 15: end while 16: return Γ

R c 0 ,e0 4 is be captured by the point (↑ (e 0 ), ↓ (e), c). The region R c 0 ,e0 3 is captured by (0, ↓(e 0 ), c<sup>0</sup> ). Notice that we do not need to include an upper bound on the correctness measure as it is already implicitly defined by the R c,e 1 region of any Pareto-optimal point (c, e). For example, in Figure 2(b) the upper bound on the correctness for region R c 0 ,e0 4 is already captured through the fact that no Pareto-optimal solutions exist in R c 0 ,e0 1 .

• if c ≤ δC, then (c, e) cannot be Pareto-optimal, because we already know that there is a Pareto-optimal interpretation with measures (δC, ↑δ u E ). In this case, we can exclude the search in the region R δC,e 1 , because if there was any Pareto-optimal interpretation with measures (ˆc, eˆ) in R δC,e 1 , then QUINTSYNT would have found this interpretation. Thus, Algorithm 1 further prunes the search region to a smaller region defined by (δ l E , ↓e, δC) (line 12). For example, if Algorithm 1 used QUINTSYNT

to synthesize an interpretation from R c 0 ,e0 4 , and returned a solution with measures (c <sup>00</sup>, e<sup>00</sup>) as depicted in Figure 2(c), then we can exclude the search in region R c,e<sup>00</sup> 1 and add the region R c,e<sup>00</sup> 3 to W.

Lastly, if QUINTSYNT returns no interpretation, then we can immediately exclude the searched region from further exploration and thus no new points are added to W in this case. For example, as shown in Figure 2(c), if QUINTSYNT found no Pareto-optimal interpretations in R c <sup>000</sup>,e<sup>000</sup> 3 , then this region is excluded from the search and Algorithm 1 continues with the next available point in W.

Next we show some important properties of Algorithm 1.

*Lemma 1 (Soundness):* For an instance hE, S, ∆C, ∆<sup>E</sup> i of the Pareto-optimal interpretation synthesis problem, if (E,(c, e)) ∈ EXPLOREPOI(E, S, ∆C, ∆<sup>E</sup> ), then (c, e) ∈ max E0∈E (∆C(fE<sup>0</sup> , S), ∆<sup>E</sup> (E<sup>0</sup> )).

In the rest of this section, we assume that each of the explainability measures has finitely many discrete values, as they are defined as floating points up to a certain precision. Thus, we obtain that the range of ∆<sup>E</sup> is finite, which allows us to obtain the following results.

*Lemma 2 (Completeness):* For an instance hE, S, ∆C, ∆<sup>E</sup> i of the Pareto-optimal interpretation synthesis problem, if (c, e) ∈ max E0∈E (∆C(fE<sup>0</sup> , S), ∆<sup>E</sup> (E<sup>0</sup> )), then there is an interpretation E with measures (c, e) such that (E,(c, e)) ∈ EXPLOREPOI(E, S, ∆C, ∆<sup>E</sup> ).

We summarize the correctness result next which follows immediately from Lemmas 1 and 2.

*Theorem 2 (Correctness of Algorithm 1):* For a class of interpretations E, a finite set of samples S, and measures ∆<sup>C</sup> and ∆<sup>E</sup> , the algorithm EXPLOREPOI terminates and returns a minimal representative set for (E, S, ∆C, ∆<sup>E</sup> ).

Algorithm EXPLOREPOI solves the interpretation synthesis problem as a multi-objective optimization problem. If we were to solve the same problem using single-objective optimization, it would be necessary to combine the accuracy and explainability measures for every interpretation to yield a single hybrid measure. Let λ : R × R → R be a function that yields such a measure. Since higher values of c and e always increase the desirability of an interpretation, we require λ to be *strictly increasing*, i.e., (c, e) ≺ (c 0 , e<sup>0</sup> ) =⇒ λ(c, e) < λ(c 0 , e<sup>0</sup> ). For example, λ(c, e) = w<sup>1</sup> · c + w<sup>2</sup> · e is a strictly increasing function for every w1, w<sup>2</sup> > 0. Then, for any (c, e) pair that is maximal wrt such a function λ, our algorithm can find an interpretation with this measure pair. Formally,

*Theorem 3 (Universality):* For every strictly increasing function λ : R × R → R and every hE, S, ∆C, ∆<sup>E</sup> i if E ∈ arg max E0∈E (λ(∆C(fE<sup>0</sup> , S), ∆<sup>E</sup> (E<sup>0</sup> ))), then there exists an interpretation E? ∈ E such that (i) ∆C(fE, S) = ∆C(fE? , S), (ii) ∆<sup>E</sup> (E) = ∆<sup>E</sup> (E? ), and (iii) (E? ,(∆C(fE? , S), ∆<sup>E</sup> (E? ))) ∈ EXPLOREPOI(E, S, ∆C, ∆<sup>E</sup> ).

We conclude the section with some remarks on Algorithm 1. *Remark 1:* Algorithm 1 can also be applied interactively as a conversation between synthesizer and user. Given a Pareto-optimal interpretation, the user may guide the search to interpretations that are more explainable or to those with more accuracy, until the user has found an optimal interpretation.

*Remark 2:* Note that there might be multiple interpretations with the same pair (c, e). In this case, Algorithm 1 will add only one of them as a representative interpretation, since the others are indistinguishable wrt correctness and explainability.

Finally, we can also search for Pareto-optimal solutions based on regions solely bounded on the correctness measure. We choose to use bounds on the explainability measure, because the sample sets tend to be large and will result in much larger encodings.

#### IV. STATISTICAL GUARANTEES FOR BLACK-BOX MODELS

In Section III, the correctness of an interpretation E, defined using a measure ∆C, was determined with respect to a set of samples S obtained from the black-box model B. Our approach guarantees that E is optimal for S and the measure ∆C. Our ultimate goal, however, is to synthesize an interpretation E that is optimal with respect to the entire black-box model B, i.e., w.r.t. the set S<sup>B</sup> = {(i, o) | fB(i) = o, i ∈ I}. Obtaining an exhaustive set of samples from a black-box model is often not practical. The question that we, therefore, raise in this section is: *how large must* S *be such that it is not* misleading*, i.e., optimal interpretations synthesized by our approach for* S *do not overfit the set, and thus the guarantees obtained over* S *can be adopted for* SB*?*

The answer to the above question lies in the theory of *Probably Approximately Correct (PAC) Learnability* [32]. The notion of a *loss function*, `, that must be minimized to obtain an optimal interpretation, is central to this discussion. For our purposes, the loss function may be viewed as 1 − ∆C, where the range of the (normalized) correctness measure ∆<sup>C</sup> is assumed to be [0, 1]. Thus for every (i, o) ∈ I × O, and f ∈ I → O, we define `(f,(i, o)) = 1 − ∆C(f, {(i, o)}). For technical reasons, we also assume that for every set S of (i, o) samples, we have ∆C(f, S) = P (i,o)∈S ∆C(f,{(i,o)}) |S| . This is true, for example, if ∆<sup>C</sup> is the prediction accuracy (the loss function being the misprediction rate in this case). Note that in this case, the loss function for the sample set S is given by P (i,o)∈S `(f,(i,o)) |S| = 1 − ∆C(f, S).

A class of interpretations (or hypotheses) E over inputs I and outputs O is said to be PAC-learnable with respect to the set Z = I × O and a loss function ` : (I → O) × Z → [0, 1], if there exists a function m<sup>E</sup> : (0, 1)<sup>2</sup> → N and a learning algorithm with the following property: For every , δ ∈ (0, 1) and for every distribution D over Z, when running the learning algorithm on m ≥ m<sup>E</sup> (, δ) i.i.d. samples generated by D, the algorithm returns a hypothesis E such that, with probability (confidence) of at least 1 − δ, LD(fE) − min E0∈ E LD(fE<sup>0</sup> ) ≤ , where LD(fE) = Ez∼D[`(fE, z)]. Furthermore, choosing an interpretation E ∈ E that minimizes P <sup>z</sup>∈S `(fE,z) |S| suffices for the learning algorithm in the above definition [32].

It is known that every finite class of interpretations is PAClearnable due to the uniform convergence property [32]. In fact, the sample complexity, i.e., the function m<sup>E</sup> , can be determined in such cases in terms of |E|, δ and . Under the standard *realizability assumption*, i.e assuming E includes an interpretation E such that f<sup>E</sup> implements the semantic function f<sup>B</sup> of the black-box, m<sup>E</sup> is bounded above by d log (|E|/δ) e. This bound increases to d 2 log (2|E|/δ) <sup>2</sup> e if we do not make the realizability assumption [32].

From the results above, if we use the m<sup>E</sup> bound for the sample size, we get interpretations that are very close to the optimal interpretation within the class E with high probability. Of course, sans the realizability assumption, this does not necessarily mean the obtained interpretation is very close to the black-box model. The latter depends highly on the class of interpretations. Note also that the price for the PAC guarantee is that we may have to work with an increased size of the sample set S, as given by m<sup>E</sup> . In general, this affects the scalability of our synthesis procedure, since size of the weighted MAXSAT formula increases linearly with |S|. This can limit how small δ and can be in practice. Nevertheless, as we show in Section V, we are able to use fairly small values of δ and in our experiments.

#### V. EVALUATION

*a) Benchmarks:* We apply our approach to three blackbox models: a *decision module* for predicting the performance of a perception module in an airplane (AP), a *bank loan predictor* (BL), and a *solvability predictor* (TP).

The decision module predicts, based on the time of day, the cloud types, and initial positioning of an airplane on a runway, whether a perception module used by the plane can be trusted to behave correctly. The decision module is an implementation of a decision tree that was trained on data collected from 200 simulations, using the XPlane (x-plane.org) simulator.

The bank loan predictor is a deep neural network that was trained on synthetic data that we created. The training set included 100000 entries chosen such that majority of people with age between 18 to 29 years, and those with age between 30 and 49 years but with income less than \$6000, were denied the loan. The network has five dense fully connected hidden layers with 200 ReLU's each, in addition to a softmax layer and the output layer comprised of two nodes.

The solvability predictor is a neural network built to predict the solvability of first-order formulas by a theorem prover with respect to percentage of unit clauses and average clause length in a formula. The network had three hidden dense fully connected layers each with 200 ReLU's. The data used to train the neural network can be found on the UCI machine learning repository [8]. We used the data for heuristic H1 from [8], thus predicting solvability for H1.

*b) Experiments and setup:* We conducted two types of experiments: (1) application of our exploration algorithm on the three benchmarks (2) performance evaluation of QUINTSYNT. The MaxSAT engine used an implementation of RC2 in PySAT [16], [17]. All experiments were conducted on a 2.4GHz Quad-core machine with 8GB of RAM. For additional details of the experiments and results, please see [31].

*c) Exploring the Pareto-optimal space:* We ran our approach on the three benchmarks mentioned above. We used confidence measure δ = 0.05 and error margin = 0.05 to determine the size of the sample set (as given in Table I) under the realizability assumption referred to in Section IV. Figures 3(a) to 3(c) show the measures of the Pareto-optimal interpretations found by our exploration algorithm. We used prediction accuracy for correctness (recall this satisfies the technical assumption mentioned in Section IV), and an explainability measure that favored decision diagrams of smaller size with predicates having a fewer number of branchings.

For all three benchmarks we found a variety of interpretations with interesting tradeoffs between the correctness and explainability measures, reflected by the blue squares in each plot. The exploration algorithm shows that searching for interpretations that are optimal only in size or in accuracy may result in unfavorable solutions. For example, in Figure 3(a) we see that the interpretation with highest accuracy has very low explainability. However, a very small tradeoff in accuracy resulted in significantly more explainable interpretations.

*d) Performance:* Table I presents our results on each benchmark and gives the confidence value δ, error rate and the number of samples |S| used for each run. The number of Pareto-optimal points (PO), total number of points explored (TNP) and minimum, maximum and median times to find a Pareto-optimal interpretation are also shown. The number shown in parenthesis next to each benchmark is the number of predicates used. From Table I we can see that the number of Pareto-optimal (PO) points is considerably smaller than the total number of points explored (TNP). The minimum time taken to find an interpretation was less than 3 seconds for all benchmarks, but there were a few points in the Pareto-optimal space where finding an interpretation took considerably more time (see the maximum times). For most Pareto-optimal points though, the time taken to the find an interpretation was less than 20 seconds, as demonstrated by the median values. If an interpretation did not exist for a combination of correctness and explanability measures, the MaxSAT solver returned UNSAT in less than a second in all performance runs.

TABLE I PERFORMANCE OF QUINTSYNT: EXPLORATION OF THE ENTIRE PARETO-OPTIMAL SPACE


As none of the other interpretation synthesis tools in the literature compute the set of all Pareto optimal interpretations, we omit comparison with other tools (any such comparison wouldn't be fair, especially when using different notions for explainability). However, to understand if the variation in running times is inherent to the problem, we performed a similar experiment with MinDS, a tool for learning decision sets [38]. In MinDS, correctness and explainability are combined in a single objective and the contribution of the explainability measure is governed by a parameter λ. We ran MinDS for 15 values of λ and found interpretations for all these values. We observed again (Table II) that the time taken to find interpretations for some λ was much more than others.

Note that unlike in our approach, running MinDS in this manner does not guarantee that the entire Pareto-optimal space of interpretations has been obtained. Finding all Pareto optimal points by varying the weights of explainability and correctness measures is also not feasible, since this requires trying out all (infinitely many) weight combinations. While some decision sets learned by MinDS were indeed semantically equivalent to some of the Pareto-optimal interpretations synthesized by our approach, some interpretations that our methods found did not have a decision set counterpart within the range of weights we experimented on. We emphasize that running approaches like MinDS that combine explainability and correctness measures into single objective function may result in the same interpretation being returned for different combinations of weights. This can be avoided using our exploration method.

TABLE II ILLUSTRATING VARIATION IN RUNNING TIMES EVEN ON NON-EXHAUSTIVE PARETO SEARCH WITH MINDS


#### VI. RELATED WORK

There is a large body of work on interpreting black-box models, where a dominant paradigm is to generate labeled data samples and obtain an interpretable model representation in terms of input features, some of which were discussed in the introduction. In some applications, the aim is to explain the

Fig. 3. Exploring Pareto-optimal solutions for three benchmarks. The size of the sample sets used for constructing interpretations was computed based on confidence values δ = 0.05 and error margin = 0.05, as well as the size of the class of interpretation in each benchmark.

output of a black-box model in the neighbourhood of a specific input, and specialized techniques [12], [24], [29], [30], [39] give such local and robust explanations. Other applications use techniques like model distillation (in the form of decision trees [7], [9], [20], [22], [23]), counterfactual explanations [26] etc. For further information on these techniques, we refer to reader to the excellent surveys in [2], [13].

The work in [15], [38] comes closest to ours. In [38], the authors encode the problem of finding an interpretation as optimal decision sets (to a weighted MAXSAT formulation). They present two variants: (i) optimize on accuracy (100%) while constraining the explanability (number of literals), and (ii) directly minimize the size of decision sets at the cost of accuracy. In [15], sparse optimal decision trees are built using an objective function that combines misclassification rate and number of leaves. Solution approaches like these give a single point of the optimized function in the Pareto-optimal space and hence a single value for the correctness and explainability measures.

Our Pareto-optimal interpretation synthesis problem formulation can also be related to Structural Risk Minimization (SRM), which is well-studied in the literature. Like in SRM, we have two orthogonal measures – one that depends only on the structure/complexity of the hypothesis/interpretation, and the other that depends on how well the hypothesis/interpretation "explains" the given sample set. The SRM formulation (e.g., see [32], Section 7.2) effectively combines these two measures into one and treats the problem as a singleobjective optimization problem. In contrast, our Pareto-optimal synthesis problem is inherently a multi-objective optimization problem. As mentioned in the introduction, such a multiobjective optimization problem cannot be reduced to a singleobjective optimization problem in general, without potentially excluding some (possibly important) solutions.

Finally, we note that the idea of using SAT (and related) solvers for systematically searching for all Pareto-optimal points has been used in other settings earlier (see, for example, systems biology applications in [4], [14]). However, their use in finding Pareto-optimal interpretations for black-box ML components appears not to have been explored earlier.

#### VII. CONCLUSION AND FUTURE WORK

We have presented a new approach to automatically generate a complete set of Pareto-optimal interpretations for blackbox ML models, which works in the absence of training or test data sets. Our interpretations are obtained by instantiating user-provided decision diagram templates, and satisfy optimality conditions, while also providing formal guarantees on the tradeoff between accuracy and explainability. We have presented an empirical evaluation demonstrating that our approach produces compact, accurate explanatory interpretations for neural networks used for applications such as autonomous plane taxiing, predicting bank loans and classifying theoremprovers. The discovery of multiple Pareto-optimal interpretations, as opposed to a single one, demonstrates the value of the multi-objective approach.

The current work focuses on finite classes of possible interpretations, although we allow a class to be combinatorially large. The weighted MAXSAT encoding allows us to solve this problem symbolically by leveraging significant recent advances in MaxSAT solving that scale to very large solution spaces. Using a finite, yet large hypothesis class permits us to strike a balance between generality and practical efficiency of our approach. An interesting avenue for futurework would be to see if our approach can be extended to interpretation classes of infinite cardinality but finite Vapnik-Chervonenkis (VC) dimension. While the overall problem formulation, the notions of Pareto-optimality of explanations, and our algorithm for finding representative sets of explanations easily adapt to this setting, we would need to go beyond the current weighted MAXSAT formulation to find individual Pareto-optimal interpretations. Using an optimization modulo theories (OMT) encoding is a promising direction for such a generalization.

Acknowledgments. This work is partially supported by NSF grants 1545126 (VeHICaL), 1646208 and 1837132, by the DARPA contracts FA8750-18-C-0101 (AA) and FA8750-20-C-0156 (SDCPS), by Berkeley Deep Drive, and by Toyota under the iCyPhy center. We would also like to express our gratitude to the anonymous reviewers for their in-depth reviews, constructive suggestions and various pointers.

#### REFERENCES


# Dynamic Partial Order Reductions for Spinloops

Michalis Kokologiannakis *MPI-SWS* Kaiserslautern, Germany michalis@mpi-sws.org

Xiaowei Ren *The University of British Columbia* Vancouver, Canada xiaowei@ece.ubc.ca

Viktor Vafeiadis *MPI-SWS* Kaiserslautern, Germany viktor@mpi-sws.org

*Abstract*—Stateless model checking (SMC) coupled with dynamic partial order reduction (DPOR) is an effective way for automatically verifying safety properties of loop-free concurrent programs. SMC, however, does not work well for programs with loops because it cannot distinguish loop iterations that make progress from ones that revisit the same state. This results in redundant exploration that dominates the verifcation time.

We present SAVER (Spinloop-Aware Verifer), a memorymodel-agnostic SMC/DPOR extension that detects *zero-net-effect spinloops* and avoids redundant explorations that lead to the same local state. As confrmed by our experiments, SAVER achieves an exponential reduction in verifcation time and outperforms stateof-the-art tools in a variety of real-world benchmarks.

*Index Terms*—stateless model checking, spinloops

### I. INTRODUCTION

*Stateless model checking* (SMC) [1] is a prominent technique for verifying safety properties of concurrent programs, especially under weak memory consistency [2]–[6]. The key design choice that makes SMC scale is that it does not record the set of states explored, but rather uses alternative techniques, namely *dynamic partial order reduction* (DPOR) [7], [8], to avoid exploring the same state multiple times. The downside of this choice, however, is that SMC struggles with spinloops, i.e., loops that continuously read a shared variable until some condition holds: as SMC does not record the set of visited program states, it cannot distinguish loop iterations that make progress from those that return to the same state. To make matters even worse, such loops are ubiquitous in realworld concurrent programs, whether lock-based or lock-free.

Consequently, spinloops typically have to be *bounded*. Since bounding generally sacrifces the soundness of the verifcation, one would like to use fairly large loop bounds to be confdent enough that the program verifed is correct. Doing so, however, is practically infeasible. A loop bound of N ≥ 2 typically leads to an exponential blowup in the state space, since the model checker explores the possibility of each spinloop failing 0, 1, . . . , N −1 times and, for each failure, all possible stores from which the spinloop loads(s) can read.

To avoid the blowup, the solution is to use a bound of N = 1. So far, this is typically done manually by rewriting the program to use **assume** statements (a.k.a. **await**), special verifer commands that block the execution of the relevant thread when the condition of the **assume** is violated.

The goal of this paper is to determine *conditions* under which it is sound to do such conversions automatically. As we shall see, this turns out to be quite challenging.

First, spinloops cannot be adequately detected by a simple syntactic criterion. Since programming languages have many ways of creating spinloops (e.g., while loops, repeatuntil loops, for-loops, goto statements), their detection is best done after converting each program thread into a *control-fow graph* (CFG). However, even there, simply removing the CFG backedges for side-effect-free loops (i.e., loops with no stores to global variables or to local variables that are live at the loop header) is insuffcient, as illustrated by the program below. As a convention, in our examples, we use x, y, z for global (shared) variables and a, b, c, ... for registers.

$$\begin{array}{l} \mathsf{do} \quad a := x \\ \mathsf{while1} \mathsf{o} \ (a \neq 0) \end{array} \left\| \begin{array}{l} b := x \\ \mathsf{while1} \mathsf{o} \ (b \neq 0) \ b := x \end{array} \text{(LOOP-PEEL)}$$

While the loop in thread I can be easily bounded by converting it into a := x; **assume**(a = 0), the one in thread II cannot because b is "live" at the header of the loop (its value is used in the loop).

Second, some spinloops may have side-effects, but these either do not occur on all their iterations or are never observed by the other threads (e.g., writing to a global variable that is not concurrently read) or cancel each other out (e.g., incrementing and then decrementing a variable, acquiring and releasing a lock). As an example of the latter kind, consider the following *zero-net-effect* (ZNE) spinloops extracted from a lock implementation.

**while** (true) a := fetch\_add(x, 1) **if** (a = 0) **break** fetch\_add(x, −1) *// critical section* fetch\_add(x, −1) **while** (true) b := fetch\_add(x, 1) **if** (b = 0) **break** fetch\_add(x, −1) *// critical section* fetch\_add(x, −1) (INC-DEC-SPIN)

Each thread tries to acquire the lock by incrementing x. If the lock was already taken, it decrements x and tries again. The lock is fnally released by decrementing x. Since each decrement cancels out the previous increment, we would like to avoid considering loop iterations with a decrement, i.e., unsuccessful lock acquisition attempts. The soundness of doing so depends on the context. If, for instance, there is another thread repeatedly reading x, it may observe the value of x fickering, which cannot happen if we bound the ZNE loops to a single iteration. Similarly, if another thread writes to x concurrently, the loop may no longer have a zero net effect, rendering the transformation unsound.

To address these challenges, we develop SAVER (Spinloop-Aware Verifer), a model checker that reduces spinloops to a single iteration. SAVER works at the level of reduced control fow graphs, obtained by merging bisimilar nodes. Whenever a spinloop cannot be shown to be side-effect-free statically, SAVER dynamically checks that the reduced spinloop iterations have a zero net effect (in particular, that the context does not observe any of their effects), and if the check fails, it rolls back the transformation.

We remark that our results are independent of the *memory consistency model*: they hold not only for sequential consistency (SC), but also for weak memory models, which admit executions that cannot be expressed as program interleavings.

#### II. PRELIMINARIES

In this section, we review how programs can be represented as control fow graphs (§ II-A), how their executions can be modeled as execution graphs (§ II-B), and how DPOR enumerates these executions (§ II-C).

#### *A. Control Flow Graphs*

To avoid cluttering the presentation, we omit all features irrelevant to loops and concurrency. We represent a concurrent program *P* as a top-level parallel composition of threads, each of which is modeled as a control-fow graph (CFG). A CFG is a directed graph whose nodes are program labels and whose edges are labeled with instructions of the following form:

$$\begin{array}{rcl} \mathsf{Intst} \ni i & ::= \, r := e \mid \mathsf{assert} \mid \mathsf{assumes}(e) \mid r := x \mid x := e \mid \\ & r := \mathsf{fectch\\_add}(x, e) \mid r := \mathsf{CAT}(x, e\_1, e\_2) \end{array}$$

where r ranges over registers (i.e., local variables), x over global (shared) variables, and e over simple expressions built from integer constants n, registers, and arithmetic operators:

$$\mathsf{Exp} \ni e \; ::= n \mid r \mid e\_1 + e\_2 \mid e\_1 - e\_2 \mid \dots$$

Instructions comprise plain assignments; **error**, that halts the program (e.g., due to a safety violation); **assume**(e), that blocks the calling thread if e has the value zero; and memory accesses. Memory accesses include r := x, that reads the value of x and stores it in r; x := e, that stores the value contained in e in the global variable x; r := fetch\_add(x, e) (fetch-and-increment) that atomically increments the value of x by the value of e and returns the old value to r, and r := CAS(x, e1, e2) (compare-and-swap), that atomically compares the value stored in location x with the value of e1, and if they are equal, replaces the value of x with the value of e2. The r := CAS(x, e1, e2) instruction always returns the result of the comparison in r. We also use the term *load instruction* to refer to r := x, r := CAS(x, e1, e2), and r := fetch\_add(x, e) instructions, while we use *store instruction* to refer to x := e, r := CAS(x, e1, e2), and r := fetch\_add(x, e) instructions.

We assume that input programs are deterministic in that each node n either has at most one successor (for standard program statements), or it has two successors labeled with **assume**(e) and **assume**(¬e) respectively (for conditionals and loops). As an example, Fig. 1 shows the CFGs for the

Fig. 1. CFGs for the two threads of LOOP-PEEL.

two threads of the LOOP-PEEL program from §I. The loops generate cycles in the CFGs, and the conditional tests (whether to execute another loop iteration or to exit the loop) generate the edges labeled with **assume** statements.

A *path* π in a CFG is an alternating sequence of nodes and instructions corresponding to edges in the CFG, starting and ending with a node. That is, π is of the form n1i1n2i2n<sup>3</sup> ... nk−1ik−1n<sup>k</sup> where (n<sup>j</sup> , i<sup>j</sup> , nj+1) is an edge in the CFG for all 1 ≤ j < k. As it is common in the literature, we are primarily interested in *simple paths*, which do not visit the same node twice, except possibly by their last node. A (simple) path is *cyclic* if it starts and ends with the same node, while a *lasso* path is one whose end node is one of its intermediate nodes. We write |π| to denote the length of the path (i.e., the number of edges it contains), and π(k) to project the k th node and/or instruction of the path.

We say that node a *dominates* b if all paths from the entry node of the CFG to b contain a. Given a path π in a CFG, we say that a node h of π is its *header* if it dominates all nodes in π. By defnition, paths can have at most one header; in the case of reducible graphs, every cyclic path has a header. For example, in Fig. 1, nodes 1 and 5 are the headers of the two cyclic paths, respectively.

A loopy path is a simple path that starts and ends at its header. Formally, a simple path π is called a *loopy path* of an edge n → h if π(1) = π(|π|) = h and π(|π| − 1) = n and h dominates all nodes in π (i.e., h is a header of π).

#### *B. Execution Graphs*

In order to keep our approach as general as possible, we follow the standard axiomatic approach of Alglave *et al.* [9] and represent the executions of a concurrent program as *execution graphs*. Using execution graphs allows us to keep our formalism memory-model-agnostic, as our contributions do not depend on a particular memory consistency model.

Execution graphs have two basic components:


The semantics of a program *P* is given by the set of execution graphs that correspond to the instructions of the program and satisfy the consistency predicate of the underlying memory model. The purpose of the consistency predicate is to rule

Fig. 2. MP: three consistent execution graphs under SC.

out executions with nonsensical edges, such as a load reading from a store later in program order or a store that has been overwritten by another store before the load.

To see how execution graphs model the executions of a program, consider the following example:

$$\begin{array}{l|l} x := 1 & \| \begin{array}{l} a := y \\ b := x \end{array} \\ \end{array} \tag{\text{MP}}$$

Under SC, the MP program has three consistent executions, shown in Fig. 2, where the solid edges represent the program order and the green dashed edges the reads-from relation. As can be seen, execution 4 is inconsistent under SC—the consistency predicate of SC forbids the load of x to read from the initial state as the load is already aware of the x := 1 store. This execution, however, is allowed under certain weak memory models, such as the 'relaxed' fragment of RC11 [10].

Let us now formally describe events and execution graphs. For a more extensive discussion regarding execution graphs, we refer interested readers to Kokologiannakis *et al.* [5].

Defnition 1. *An* event*,* e ∈ Event*, is either an initialization event* ⟨init l⟩ *for a location* l ∈ Loc *or a thread event* ⟨t, i, lab⟩ *where* t ∈ Tid *is a thread identifer,* i ∈ Idx △= N *is a serial number inside each thread, and* lab ∈ Lab *is a label that takes one of the following forms:*


Defnition 2. *An* execution graph G *consists of:*


Our formal defnition of execution graphs does not record the *program order* (po) as an explicit component because it

Algorithm 1 Dynamic Partial Order Reduction

1: procedure VERIFY(*P*) 2: ⟨G, Γ⟩ ← ⟨G∅, Γ∅⟩ 3: do 4: VISITONE(*P*, G, Γ) 5: while ⟨G, Γ⟩ ← pop(Γ) 6: procedure VISITONE(*P*, G, Γ) 7: while consistent(G) ∧ a ← next*P*(G) do 8: G.E ← G.E ⊎ {a} 9: if a ∈ error then exit("error") 10: else if a ∈ R then 11: let {w0} ⊎ ws = G.E ∩ Wloc(a) 12: G ← SetRF(G, w0, a) 13: Γ ← push(Γ, { SetRF(G, w, a) | w ∈ ws} ) 14: else if a ∈ W then 15: CALCREVISITS(G, Γ, a) 16: CHECKZNEVALIDITY(G, a)

can be defned directly from our representation of events:

$$\mathbf{p} \triangleq \left\{ \langle \langle \mathtt{init} \ l \rangle, \langle t, i, lab \rangle \rangle \; \middle| \; \forall l, t, i, lab \right\} \cup \left\{ \begin{aligned} \forall l, t, i, lab \\ \langle \langle t\_1, i\_1, lab\_1 \rangle, \langle t\_2, i\_2, lab\_2 \rangle \rangle \; \middle| \; t\_1 = t\_2 \land i\_1 < i\_2 \right\} \end{aligned} \right}$$

Initialization events precede all non-initialization events in po, while events in the same thread are ordered according to their serial numbers. Events from different threads are unordered.

#### *C. Dynamic Partial Order Reduction*

DPOR verifes a program by generating all of its consistent execution graphs and checking that none of them contains an error. To do so, DPOR typically assumes some basic properties of the consistency predicate, such as prefx-closedness and extensibility [5], which are satisfed by all known memory models that follow the graph representation of § II-B.

This graph representation is also very helpful for DPOR because it encodes the independence relation that is traditionally used by DPOR algorithms to decide which interleavings should be explored. Indeed, under sequential consistency, each graph corresponds to the set of thread interleavings that are equivalent under the *reads-from equivalence* [11], [12] (or under Mazurkiewicz equivalence if we extend the graphs to also record the coherence order).

Algorithm 1 shows the general structure of a DPOR algorithm. The procedure VERIFY verifes a concurrent program *P* by starting from the graph G<sup>∅</sup> containing only the initialization events and an empty *environment* Γ<sup>∅</sup> (Line 2), and exploring the executions of *P* one by one by calling VISITONE (Line 4). VISITONE does most of the exploration work: it explores one full execution of *P* and populates Γ with alternative exploration options. These exploration options recorded in Γ are later explored by VERIFY (Line 5).

At each step, VISITONE extends the current execution G by one event a (obtained via next*P*(G)), as long as G remains consistent according to the memory model (Line 7). If there are no more events to add, then G is complete, and VISITONE returns. If a denotes an error (e.g., an assertion violation), it is reported to the user and verifcation terminates (Line 9).

If a is a read, then it must read from some write in G. To this end, VISITONE calculates the set of all writes in G on the same location as a (Line 11), and chooses one write w<sup>0</sup> as the reads-from option for a (Line 12). For all other same-location writes, an alternative execution is added to Γ so that it can be explored later by VERIFY (Line 13).

If a is a write, it needs to revisit existing reads of the same location in G, because a was not present in the graph when VISITONE was considering possible reads-from options for these reads. To that end, VISITONE calls CALCREVISITS (Line 15), which extends Γ with such alternative explorations. Since the discussion on how these explorations are calculated is not relevant for this paper, we do not present it here; we refer interested readers to Kokologiannakis *et al.* [5], where CALCREVISITS is explained in detail.

Note that Algorithm 1 does not have any special treatment for **assume** statements. Whenever next*P*(G) encounters an **assume** statement whose condition is not satisfed, it returns a blocked event and stops scheduling that thread thereafter. When VERIFY later pops some graph that does not contain the blocked label (e.g., because the graph represents an alternative exploration choice before the blocked event), the thread will be again schedulable, and other options that might not block the **assume** will be considered.

#### III. BOUNDING EFFECT-FREE SPINLOOPS

Effect-free loop iterations that do not exit the loop are almost unobservable: they do not affect the set of reachable program states, and so can be ignored when verifying safety properties of a program. (We note that for liveness properties, effect-free loop iterations cannot be discarded that simply. An infnite sequence of such effect-free iterations, unless prevented by some fairness assumption about the program's semantics, yields a non-terminating run of the program.)

What remains to be clarifed is what exactly constitutes an effect-free loop iteration. Clearly, the iteration should not be writing to a global variable, as otherwise other threads may be able to observe whether the iteration took place or not. Similarly, it should also not be assigning to any local registers that could affect the subsequent execution of the thread itself, i.e., to any variables that are *live* at the header of the loop. Assigning to a dead variable is harmless because, by defnition, it does not affect the subsequent execution of the thread, even if technically it might reach a slightly different local state (differing only in the values of dead variables).

We note that spinloops need to be effect-free only along looping paths—they may well have side-effects on paths exiting the loop. This is frequently the case for CAS-loops, such as the following implementation of an atomic increment:

$$\begin{array}{ll} \mathsf{do} & \\ & a := x \\ & \text{success} := \mathsf{CAS}(x, a, a+1) \\ \mathsf{while1.Le} \,(\neg \mathsf{success}) \end{array} \tag{\mathsf{CAS-LOOP}}$$

Fig. 3. Simplifed dequeue operation from the ms-queue benchmark and its CFG, whose instructions are abbreviated. In the code, head, next, and tail are global variables, while b, h, h ′ , n, and t are local registers.

Here, even though the loop contains a CAS, which is generally an effectful instruction, along the looping path, the CAS fails, and so the path is effect-free.

We also note that loops often have multiple looping paths, only some of which are effect-free. Consider, for instance, the **while** loop in Fig. 3, which is extracted from the ms-queue benchmark of §VIII. It contains three loopy paths. The frst (through the **continue** statement) is trivially effectfree because it contains only loads and assignments to dead variables. (All local variables are dead at the loop header.) The second path (when h = t) can have side-effects—the CAS to tail. The third path (when h ̸= t) is again effect-free because whenever its CAS succeeds, the function returns.

Let us now make these intuitions more formal. A path π is *pure* if it either contains no store instructions or, if it contains any, all of them are failed CASes. That is, whenever π(i) is a store instruction, then it is of the form r := CAS(x, e1, e2) and there is i < j < |π| such that π(j) = **assume**(¬r) and for all i < k < j, π(k) does not assign to r.

Pure paths do not affect the global state, but can affect the local state. A loopy path does not affect the local state if it always reaches the same local state it started from. A simple approximation to reaching the same state is for the path to not assign to any variable that is live at its header. Putting these conditions together, an *effect-free spinloop* is a pure loopy path that does not assign to any variable live at its header. Formally:

Defnition 3. *A CFG edge* n → h *is an* effect-free spinloop backedge *if every loopy path of* n → h *is pure and assigns only to registers dead at* h*.*

The *spin-assume transformation* removes all effect-free spinloop backedges from the CFG. Returning to the example in Fig. 1, the edge 2 → 1 is an effect-free spinloop backedge; removing it transforms thread I of LOOP-PEEL into a := x; **assume**(a = 0). In contrast, the backedge of thread II (6 → 5) is not effect-free and so the spin-assume transformation does not affect thread II.

#### IV. DETECTING MORE KINDS OF SPINLOOPS

While the spin-assume transformation defned in the previous section can detect typical cases of **do**-**while** spinloops, it does not apply to **while** loops that have a non-trivial condition.

The main problem is that the registers used to evaluate the condition are live at the loop header, and so any loop iterations that update these registers are deemed effectful. As a simple example, consider the spinloop of thread II of LOOP-PEEL from §I: register b is live at the beginning of the loop, and so the body of the loop (b := x) is effectful. (Formally, in the CFG of Fig. 1, register b is live at node 5—the loop header.)

One simple way to resolve this problem is to apply a compiler transformation called *loop rotation*, which moves the loop exit checks to the end of the loop. Applying loop rotation transforms the second thread of LOOP-PEEL as follows:

$$\begin{array}{ll} b := x & b := x \\ \mathsf{while} \mathsf{1e} \ (b \neq 0) & \leadsto \qquad \mathsf{if} \ (b \neq 0) \\ b := x & \mathsf{do} \ b := x \ \mathsf{while} \ \mathsf{1e} \ (b \neq 0) \end{array}$$

The transformed loop can be bounded with the spin-assume transformation yielding executions with at most two loads of x. We note that this bounding outcome is suboptimal, since thread I of LOOP-PEEL is bounded with a single load of x.

A better approach for this example is to exploit *bisimilarity* among CFG nodes. Two nodes are bisimilar if they produce the exact same computations, i.e., if their outgoing edges can be matched 1-to-1 in a way that every two matched edges are labeled with the same instruction and lead to bisimilar nodes. Bisimilarity can be computed as a greatest fxed point, starting with the identity relation (i.e., each node being bisimilar to itself) and adding pairs of nodes whenever they have matching outgoing edges to nodes already calculated to be bisimilar. For example, in Fig. 1, nodes 4 and 6 are bisimilar because they both have only one outgoing edge labeled with the same instruction (b := x) and leading to the same node (5).

Having detected that two (distinct) nodes a and b are bisimilar, we can then merge them into one node by redirecting b's incoming edges to a and deleting node b. For example, merging nodes 4 and 6 of Fig. 1 would add an edge from 5 to 4 with label **assume**(b ̸= 0), and remove node 6. Effectively, this transformation converts the second thread of LOOP-PEEL to a **do**-**while** loop analogous to that in its frst thread, which makes the spin-assume transformation applicable.

We note that merging bisimilar nodes is not always strictly better than loop rotation. There are cases where loop rotation (or a similar transformation called *jump threading*) can transform a loop into the **do**-**while** form, but no two distinct bisimilar nodes exist. Such cases frequently arise with **CAS** loops like the following.

$$\begin{aligned} \text{success} &:= false \\ \mathbf{while1.} \mathbf{e} \text{ (}\neg \text{success)} \\ a &:= x \\ \text{success} &:= \text{CAS}(x, a, a+1) \end{aligned} \tag{\text{CAS-LOOP2}}$$

Here, the spin-assume transformation is not directly applicable to CAS-LOOP2 because *success* is live at the loop header and is updated by the loop body. Loop rotation and/or jump threading, followed by dead assignment elimination, convert this program to CAS-LOOP, which can by handled by the spinassume transformation. By contrast, merging bisimilar nodes does not change the program, since the program does not contain the same instruction twice.

#### V. DYNAMICALLY CHECKING PURITY

The spin-assume transformation as described in §III uses a completely static defnition of purity. If a CAS along a CFG path cannot be determined to always fail, the path is deemed effectful. This is, however, suboptimal for two reasons.

First, using a static purity defnition prevents us from transforming paths that are pure only under certain contexts. For instance, consider the thread below, and assume that it is running as part of a program that only writes the value 0 to z (this might not be inferable statically):

In this case, the (only) loopy path of this thread will not be deemed pure (as the CAS is not followed by an **assume**(¬b) statement), even though it will never produce observable effects in its running context as a will always be 0.

Second, in cases where a loopy path contains a CAS that *does* have observable effects, it is wasteful to explore executions where such a CAS fails. To see this, consider again the dequeue operation of the ms-queue example in Fig. 3. As explained in §III, the second loopy path of this operation is not pure, as it potentially has side-effects. Still, it does not make sense to consider iterations where the CAS of this path fails, as they both do not contribute to the loop exiting, and they produce no observable side-effects.

Leveraging the insights above, we say that a CFG backedge n → h is a *potentially effect-free spinloop backedge* if every loopy path of n → h assigns only to registers dead at h. The *dynamic-spin-assume transformation* marks all potentially effect-free spinloop backedges with a dynamic purity check. Whenever the next*P*(G) function of Algorithm 1 encounters such a check, it validates whether G contains any write event originating from the respective loop iteration and, if not, it returns a blocked event, thereby blocking the execution of the respective thread. Otherwise, if the loop iteration did generate a write event, next*P*(G) proceeds with the next event.

In fact, the dynamic purity check described above can be relaxed even further: SAVER allows loop iterations to contain write events, as long as these only affect memory locations that are not reachable by other threads. In turn, this proves very useful in cases where some initialization writes need to take place as part of a loop.

To see an example of this, consider the push operation of the treiber-stack benchmark (cf. Fig. 4). First, a node to be inserted to the stack is created, but this node cannot be initialized fully: its next feld needs to point to the existing

Fig. 4. Simplifed push operation from the treiber-stack benchmark with its CFG: stack is a global variable, while b, n, and s are registers.

top of the stack, but the stack top might change between the time it is read, and the time the node is created. Thus, the push operation frst reads the stack, sets it as the node's next, and then tries to atomically replace the stack with the newly created node. If the replacement succeeds, the operation exits; otherwise, it tries again. Notice, however, that, as long as the replacement CAS does not succeed, the store to the node's next remains unobserved by the other threads. Thus, it is safe to consider failed CAS loop iterations as effect-free, and block their exploration.

As a fnal remark, we observe that validating effect-free loops dynamically makes SAVER resilient to more aggressive loop rotation passes that convert loops to a canonical form containing a single backedge (see §VII).

#### VI. HANDLING ZERO-NET-EFFECT SPINLOOPS

Let us now consider the more challenging case of *zero-neteffect* (ZNE) loops. Recall that these are spinloop iterations that do have side-effects but (1) whose side-effects cancel each other out, and (2) whose intermediate effects are not observed by other threads. While condition (1) can be checked pretty well statically, condition (2) has to be checked dynamically. In the discussion below, we focus on ZNE loops that arise because of an atomic increment being followed by an atomic decrement of the same location and value.

A decrement instruction at node k is a *canceling decrement* in a loop h if all of h's loopy paths that contain node k also contain a prior opposite increment instruction, and the paths are effect-free modulo two instructions. More formally:

Defnition 4. *A node* k *in a (minimal) CFG cycle with header* h *is a* canceling decrement *if it has a (unique) outgoing edge of the form* r<sup>1</sup> := fetch\_add(x, n)*, and for every loopy path* π *of* h *such that* π(i) = k *for some* 1 < i < |π|*, there exists* j < i *such that* π(j) = r<sup>2</sup> := fetch\_add(x, −n) *for some* r2*, and replacing the instructions at* π(i) *and* π(j) *with plain assignments to* r<sup>1</sup> *and* r<sup>2</sup> *yields an effect-free path.*

SAVER's *spin-zne* transformation annotates all canceling decrements so that when next*P*(G) encounters them for the frst time (cf. Algorithm 1, Line 7), it generates a zne(x) event and blocks the thread instead of generating a read event and afterwards a write event. The zne(x) event serves as a marker for SAVER to validate that the transformation is sound.

Validation of ZNE loops happens every time a new event e is added to the graph by calling the CHECKZNEVALIDITY

Fig. 5. Execution graph encountered during the exploration of ZNE-OBS.

routine (Algorithm 1, Line 16). If we use the pair ⟨w, z⟩ to represent a blocked ZNE loop iteration with w being the event corresponding to the increment of the ZNE loop and z being the zne event, the addition of e can render the reduction of the ⟨w, z⟩ loop unsound in one of the following two ways.

First, if e writes to the same location as w, it can be ordered (in coherence) between w and the blocked decrement (after z), and so, unless e is also an atomic increment, w and its corresponding decrement will no longer cancel each other out.

Second, if e reads from w and there is already some other read event reading from w, then, in an alternate execution, it is possible for e to read from the canceling decrement instead of w, thereby observing the value of the shared variable fickering. To see this, consider the example below.

$$\begin{array}{c} \textbf{while} \,\textbf{1e} \,(true) \\ a := \,\textbf{f} \,\textbf{t} \,\textbf{c} \,\textbf{d} \,\textbf{a} \,\textbf{d} \,(x,1) \\ \textbf{if} \,(a = 42) \,\textbf{b} \,\textbf{reak} \\ \textbf{f} \,\textbf{t} \,\textbf{c} \,\textbf{d} \,(x,-1) \end{array} \begin{array}{c} b := x \\ \textbf{i} \,\textbf{f} \,(b) \\ c := x \\ \textbf{as} \,\textbf{s} \,\textbf{r} \,\textbf{t} \,(c) \end{array} \text{(ZNE-OBS)}$$

Note that the loop of the frst thread fulflls the conditions of a ZNE loop, and so the second fetch\_add() will be annotated by the spin-zne transformation.

Figure 5 shows the execution graph arising from adding the events of thread I and then adding the read event corresponding to the b := x instruction of thread II in the case it reads the incremented value of x. Next, we have to add the event corresponding to c := x. In this graph, the only consistent option for this event is to also read the incremented value of x, which satisfes the subsequent assertion. Yet, if we had the decrement of x instead of the zne event in the graph, c could also have read the value 0 from the decrement, and the **assert** would have failed. Thus, it is clear that concurrent reads can render the transformation of ZNE spinloops unsound.

Therefore, CHECKZNEVALIDITY(G, e) (cf. Algorithm 2) checks whether either of these two conditions holds for any existing zne(x) event in the graph (where x is the location accessed by e), and if so, it removes the zne event(s) and unblocks the corresponding thread(s), which will eventually add the missing decrement event(s) and restore soundness.

Other cases of ZNE loops can be handled in a similar manner. For example, consider spinloops containing matching lock acquisitions and releases. In such a case, acquiring the lock acts as the increment operation and releasing the lock as the matching decrement. Statically, it therefore suffces to check that each lock release in the spinloop has its corresponding lock acquisition earlier in the same spinloop iteration.

#### Algorithm 2 ZNE Spinloop Validity Check


Dynamically, we simply check that no other thread accesses the lock besides by calling the acquire and release methods.

#### VII. IMPLEMENTATION

We implemented SAVER as an extension to the open-source GENMC tool [5], [13]. GENMC is a state-of-the-art stateless model checker for C/C++ programs that works at the level of LLVM Intermediate Representation (LLVM-IR), and can verify programs under weak memory models such as RC11 [10] and IMM [14]. SAVER is implemented as (a) a collection of transformation passes that modify GENMC's input before the latter starts the verifcation procedure, and (b) slight modifcations to GENMC's DPOR algorithm that handle the dynamic checks for pure and ZNE loops.

As expected, SAVER imposes negligible overhead over GENMC, as its transformations take place statically, before the verifcation procedure starts, and the dynamic conditions for purity and ZNE loops can be checked in O(n) time (where n is the size of the graph), which is dominated by GENMC's existing consistency checks.

We conclude this section with some remarks regarding the implementation of loop rotation and the merging of bisimilar nodes over GENMC/LLVM.

In the case of loop rotation, we have implemented our own custom loop rotation pass that applies to loops whose rotation is deemed worthwhile. Although LLVM already contains an implementation of loop rotation, that implementation performs a more aggressive transformation by converting loops to a canonical form containing a single backedge. That is, if the loop contains multiple backedges, it constructs a new node with a backedge to the loop header and redirects all the existing backedges to the new node. This latter transformation is detrimental to the static detection of effect-free paths because it would, for example, confate the three loopy paths of ms-queue's dequeue operation (Fig. 3), thereby disabling the spin-assume transformation for the two that are effect-free. To avoid this unintended consequence, one would then have to undo this transformation (e.g., by invoking a form of jump threading) or rely on dynamic purity checks (§V). Instead, and to be able to statically transform as many loops are possible, we opted for implementing our own loop rotation pass, that transforms simple loops like CAS-LOOP2; loops that are not captured by our loop rotation pass are handled dynamically.

In the case of merging of bisimilar nodes, there are also a couple of points worth mentioning. First, detecting bisimilar nodes on LLVM is more complicated than what was discussed in §IV because LLVM represents programs in *static single assignment* (SSA) form. The effect of this design choice is that there are never two nodes with identical assignments on their outgoing edges, since by the SSA defnition each assignment is to a different register. Therefore, the standard bisimilarity algorithm outlined earlier in this section will not detect any nodes as being bisimilar!

As an example, consider the "SSA-CFG" of thread II of the LOOP-PEEL program from §I, which is shown below.

The SSA-CFG is an enriched kind of CFG whose nodes may have ϕ-guards that defne a variable differently depending on the incoming control fow path. For instance, in the SSA-CFG above, at node 2, b<sup>1</sup> is defned to be equal to b<sup>0</sup> if node 2 is reached from node 1, or to b<sup>2</sup> if it is reached from node 4.

In order to match nodes 1 and 4, our bisimilarity implementation has not only to account for ϕ-nodes, but also unify the variables b<sup>0</sup> and b2. It does so by collecting equality constraints and solving them by unifcation. For each node with more than one incoming edge, the algorithm starts iterating backwards for each pair of predecessors, and collects the constraints under which these predecessors are equal, simplifying them along the way. The iteration stops when some nodes cannot be equal under any constraints, or the entry node has been reached. At that point, any pair of nodes whose constraints can be trivially solved (namely, nodes 1 and 4 above) are deemed bisimilar.

Besides making bisimilarity detection more complex, SSA also affects the merging of bisimilar nodes. Consider the program below along with its SSA-CFG.

$$\begin{array}{ll} a := 0 & a\_0 := 0 \\ b := x & \bigoplus\_{\begin{array}{c} \mathbf{while} \mathbf{1} \mathbf{e} \end{array}} \bigoplus\_{\begin{array}{c} b\_0 := x \\ b\_0 := x \end{array}} \end{array}} \bigoplus\_{\begin{array}{c} \mathbf{e} \mathbf{e} \end{array}} \mathbf{e} } $$
 
$$\begin{array}{ll} a := a + 1 & a\_1 := \phi(a\_0/2, a\_2/4) \bigoplus\_{\begin{array}{c} b\_1 := \phi(b\_0/2, b\_2/4) \end{array}} b\_1 := \phi(b\_0/2, b\_2/4) \\ b := x & b\_2 := x \end{array} \end{array} $$

As can be seen, each of the assignments is to a different register, and node 3 contains two ϕ-guards (one for a and one for b) selecting the appropriate register to use depending on the incoming branch. With the algorithm outlined above one can detect that nodes 2 and 4 are bisimilar. However, one cannot simply add an edge a<sup>2</sup> := a<sup>1</sup> + 1 from node 3 to node 2 because that would violate the SSA form. To ensure that the resulting CFG is well-formed we also have to introduce a ϕguard at node 2 to say which version of a should be used for node 2. Our implementation achieves this by *moving* ϕ-guards the incoming values of which have not been deemed bisimilar (e.g., the ϕ-guard for a here) to the new loop header, along with any other incoming edges these ϕ-guards have.

#### VIII. EVALUATION

In this section, we evaluate the effectiveness of SAVER's optimizations on a variety of benchmarks. Our evaluation comprises two distinct parts, with the frst part concerning the overall performance of SAVER in a real-world setting, and the second part evaluating the effectiveness of employing individual transformations.

In general, we observe that applying the transformations introduced in this paper typically leads to *exponential gains* in real-world benchmarks with spinloops. Key to these gains are SAVER's dynamic checks for spinloop purity and/or validity of ZNE spinloops, as well as the bisimilarity-based reduction of CFGs, which enables more spinloops to be bounded.

We conducted all experiments on a system with an Intel(R) Core(TM) i5-6600 CPU (4 cores @ 3.30GHz) and 16GB of RAM, running a custom Debian-based distribution. We used LLVM 7 for GENMC (v0.5.3). All reported times are in seconds. We set the timeout limit to 30 minutes.

#### *A. Overall Performance*

We start by evaluating SAVER on some challenging data structures utilizing weak-memory atomics that we harvested from the literature, including all data-structure benchmarks from GENMC's original paper [5]. Since we want to measure the effectiveness of SAVER's optimizations over the existing GENMC implementation, we do not compare against other tools and use GENMC as a baseline for our comparison. Since GENMC already contains a simple heuristic that converts some very simple **do**-**while** spinloops into **assume** statements, we use two versions of GENMC: one with its heuristic disabled and one with it enabled.

As can be seen in Table I, these benchmarks demonstrate that SAVER is extremely effective in a real-world setting, and that SAVER's extensions combined lead to exponential gains. For all these benchmarks apart from mutex-musl, we have used an unroll value of N + 1 (where N is the number of threads, shown in parentheses) for both GENMC and SAVER to avoid manually unrolling any loops that spawn threads or initialize thread-local variables. For mutex-musl an unroll value of 2 and some manual unrolling was used, to keep the state space manageable. The transformations that SAVER applies are shown on the rightmost column, where S, D, Z, L, and B stand for spin-assume, dynamic-spin-assume, zneassume, loop-rotation, and bisimilarity, respectively.

As can also be seen, GENMC's simple heuristic is of rather limited value. It works very well only for the frst two benchmarks (mcslock and qspinlock), where it matches the performance of SAVER. For the next three benchmarks (seqlock, mpmc-queue, and linuxrwlocks), it reduces the number of executions explored, but is still much slower than SAVER. Specifcally, for mpmc-queue(4) and linuxrwlocks(4) GENMC does not manage to terminate within the time limit, while for seqlock(4) it needs 30.71 seconds. For the remaining eight benchmarks, GENMC's heuristic does not apply at all.

SAVER, on the other hand, is able to employ its transformations (even if only partially) on all the benchmarks and, with

TABLE I REAL-WORLD BENCHMARKS


the exception of mutex-musl, this leads to a huge reduction in verifcation time over GENMC. That is, even if in some cases, SAVER only applies spin-assume/zne-assume in some of the data-structure's methods, or even in some paths of a particular method, SAVER is still orders of magnitude faster than GENMC. Concretely, for all benchmarks, SAVER is able to transform at least one of the spinloops completely into an **assume** statement. For seqlock, SAVER reduces the read paths; for mpmc-queue, it reduces both the enqueue and dequeue methods; for linuxrwlocks, the read lock and write lock methods, for chase-lev, the steal method; for treiber-stack, the pop method; for mutex, mutex-musl, ttaslock, and twalock, various spinloops in the lock and unlock paths; for ms-queue, the enqueue and dequeue methods; and for scgather the check method. Finally, the smaller gains in verifcation time for mutex-musl are due to the small unroll value used and the fact that SAVER's transformations do not apply to all the benchmark loops.

### *B. Employing Dynamic Purity/Unobservability Checks*

As it can be seen from Table I, in more than half of the benchmarks, SAVER checked the purity of a spinloop or the non-observability of its intermediate effects dynamically. Dynamic checking proves useful for three cases.

First, in cases like ms-queue, plain spin-assume is not enough to fully transform some spinloop iterations into

TABLE II BENEFITS OF BISIMILARITY


**assume** statements because they contain possibly succeeding CAS operations. Recall from Fig. 3 that the second loopy path of the simplifed dequeue implementation is not effect-free. By adding a dynamic check to the relevant backedge, SAVER only considers iterations where the CAS actually succeeds, thus greatly reducing the state space of the program.

Second, in other cases (e.g., mutex and ttaslock), dynamicspin-assume is necessary as spinloops contain function calls possibly containing side-effects. As it is diffcult to determine statically whether these side-effects will actually take place in the particular calling context, the check is deferred to runtime.

Third, the unobservability checks both for initialization writes in failed CAS loops (e.g., treiber-stack) and for ZNE loops (linuxrwlocks and scgather) are very hard to perform statically with suffcient precision. As such, performing them dynamically is the only viable option.

#### *C. Employing Loop Rotation and Bisimilarity Reduction*

Loop rotation and bisimilarity reduction are similarly important in some real-world test cases. Even though they do not yield any performance improvements on their own, they are instrumental in making the spin-assume and zne-assume transformations applicable to more complex cases. Specifcally, in benchmarks like ms-queue and linuxrwlocks, spin-assume and zne-assume are not applicable without loop rotation and bisimilarity respectively. And, in fact, these are not the only cases that we have encountered; there are many ways to rewrite the same benchmarks so that they also require bisimilarity and/or loop rotation, thus rendering these transformations a necessity, as opposed to an enhancement.

As a further demonstration of their usefulness, we consider two synthetic test cases inspired by the LOOP-PEEL example. In these tests, some threads repeatedly write to a shared variable, which is read by readers that employ schemes similar to LOOP-PEEL's second thread. As explained in §III, spinassume is not directly applicable in such cases because the live variables of the header are redefned within the loop. Thus, we used an unroll value of 3, and manually unrolled any loops utilized by the writer threads. For these benchmarks, we used three SAVER versions: the default version that employs both bisimilarity and loop rotation (SAVER), a version where bisimilarity is disabled (SAVER\<sup>B</sup>) and a version where both bisimilarity and loop rotation are disabled (SAVER\B\<sup>L</sup>). The results can be seen in Table II.

With bisimilarity reduction, SAVER transforms the spinloops into **assume** statements and only explores one execution, since only one combination of values satisfes the **assume**s. Applying only loop rotation is equivalent to transforming the syntactic spinloops in these programs into **assume** statement but keeping the peeled iteration. Thus, SAVER\<sup>B</sup> explores a much larger number of executions, which affects the verifcation time. Applying neither transformation (SAVER\B\<sup>L</sup>) explores a huge number of executions and often timeouts. These results highlight the necessity of being resilient against small syntactic variations as, even if a single read is not taken into account when transforming a spinloop into an **assume**, the state space might grow exponentially.

#### IX. RELATED WORK AND CONCLUSIONS

We have presented a set of automated techniques for soundly bounding various kinds of spinloops to a single iteration, which empowers SMC to reason effectively about programs containing such spinloops. Although our contribution was presented in terms of SMC, it should be equally applicable to SAT/SMT-based bounded model checking (BMC) implemented by different tools (e.g., [15]–[17]).

Although there is a large body of work on model checking concurrent programs (e.g., [12], [18]–[22]), we are not aware of any other automated technique for bounding such a wide range of spinloops including potentially effect-free and ZNE loops. NIDHUGG [3], [23], RCMC [4] and GENMC [5], [13] are the only other tools we are aware of that automatically transform some spinloops to **assume** statements but they limit themselves to very simple busy-wait loops with no side-effects and no CAS instructions and they are not resilient to simple syntactic variations of such loops. POET [24] does recognize spinloop iterations that do not make progress, but saves the program state in order to do so.

Since both SMC and BMC cannot handle programs with executions of unbounded length, most tools bound the number of allowed loop iterations by a user-specifed bound. Other tools like CDSCHECKER [2] use a memory-liveness bound to ensure termination for spinloops. As shown in §VIII, bounding techniques in general are inferior to converting spinloops to **assume** statements in terms of scalability.

Bounding of spinloops to a single iteration is, however, not a totally new idea. In a rather different context, Flanagan *et al.* [25] have used purity for proving atomicity of concurrent libraries treating effect-free spinloops as though they had been reduced to **assume** statements. Elmas *et al.* [26] have also performed similar transformations in their tool QED, which allows a programmer to initiate a sequence of reductions and abstractions to statically establish correctness of a program.

#### ACKNOWLEDGMENTS

This work was supported by a European Research Council (ERC) Consolidator Grant for the project "PERSIST" under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 101003349).

#### REFERENCES


*ACM Program. Lang.*, vol. 3, no. POPL, 69:1–69:31, Jan. 2019, ISSN: 2475-1421. DOI: 10.1145/3290382.


# Robustness between Weak Memory Models

Soham Chakraborty EEMCS, TU Delft Email: s.s.chakraborty@tudelft.nl

#### *Abstract*—

Robustness of a concurrent program ensures that its behaviors on a weak concurrency model are indistinguishable from those on a stronger model. Enforcing robustness is particularly useful when porting or migrating applications between architectures. Existing tools mostly focus on ensuring sequential consistency (SC) robustness which is a stronger condition and may result in unnecessary fences.

To address this gap, we analyze and enforce robustness between weak memory models, more specifcally for two mainstream architectures: x86 and ARM (versions 7 and 8). We identify robustness conditions and develop analysis techniques that facilitate porting an application between these architectures. To the best of our knowledge, this is the frst approach that addresses robustness between the hardware weak memory models.

We implement our robustness checking and enforcement procedure as a compiler pass in LLVM and experiment on a number of standard concurrent benchmarks. In almost all cases, our procedure terminates instantaneously and insert signifcantly less fences than the naive schemes that enforce SC-robustness.

#### I. INTRODUCTION

Robustness analysis checks whether a program running on a weak memory consistency model demonstrates only the behaviors that are allowed by a stronger model. Robust programs can therefore be seamlessly migrated from one model to another as far as their concurrent behaviors are concerned. If a program is not robust, we can insert fences to enforce robustness.

Robustness analysis is especially benefcial in porting applications [1, 2] where it is crucial to preserve the observable behaviors of a running application. For instance, consider the porting of an application written for x86 to ARM. Since the x86 model is stronger than the ARM models (x86 exhibits less behavior), x86-robustness abstracts the underlying ARM machine specifcation to an outside observer. Consider the following programs where initially X = Y = 0.

$$\begin{array}{c||c||c||c} X = 1; & \left\| \begin{array}{c} Y = 1; \\ b = X; & \left(\mathbf{SB}\right) \end{array} \right\| & a = X; & \left\| \begin{array}{c} b = Y; \\ Y = 1; & \left\| \begin{array}{c} X = 1; \end{array} \right\| \end{array} \right\| \\ \text{(LB)} \end{array}$$

Both x86 and ARM allow same set of concurrent executions in the SB program and hence indistinguishable on x86 and ARM. Therefore SB can be ported seamlessly between these architectures. Now consider the porting of the LB program from x86 to ARM. x86 disallows a = b = 1 but ARM allows the outcome. Hence the LB program in ARM is not x86-robust. To enforce x86-robustness we insert fences in both threads and restrict the a = b = 1 outcome.

Checking and enforcing robustness to a stronger but non-SC model from a weaker model can play a key role in migrating programs between architectures having weak concurrency models. Existing SC-robustness approaches may not provide an optimal solution as they check a stronger constraint and hence may introduce additional fences. For example, if we use an SC-robustness checker for SB, it identifes that the a = b = 0 outcome is allowed on ARM but disallowed in SC. Hence the analyzer inserts two full fences (DMB in ARMv7 and DMBFULL in ARMv8) between the memory accesses in both threads which are unnecessary in this case.

To address this scenario we propose robustness analysis and enforcement between weak memory models of two mainstream architectures: x86 and ARM (version 8 and 7). As ARMv8 is a stronger model than ARMv7, we also study ARMv8-robustness for ARMv7 to enable application porting between these ARM models. We also check SC-robustness in x86, ARMv8, ARMv7 and restrict relaxed memory behaviors.

In this paper we propose M-K robustness where M is a stronger model than K and M can also be a non-SC model unlike existing approaches in [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. We propose the M-K robustness conditions in §III and prove their correctness [15]. Our proposed M-K robustness conditions ensure that if a K-consistent execution satisfes the M-K condition then the execution is also M-consistent. We check if certain memory access pairs are appropriately ordered in a K-consistent execution so that the execution shows no weaker behavior. Otherwise we insert fences to enforce order and restrict the weaker behaviors. However, as fences are costly, we investigate if it is possible to weaken the robustness constraints for the memory access pairs which are on samelocation or are ordered by dependencies. We observe that these relations suffce in x86 and ARMv8, but the results in ARMv7 are counter-intuitive.

• We note that dependency based ordering *preservedprogram-order* (ppo) is not strong enough to ensure robustness in ARMv7. Consider the following ARMv7 program.

$$\begin{aligned} \left\| \begin{aligned} a = T; \\ X = a; \end{aligned} \right\| \left\| X = 2; \\ \left\| Y = b; \right\| \left\| Z = c; \right\| \end{aligned} \right\| \left\| Z = 1; \right\| \left\| \begin{aligned} d = Z; \\ T = d; \end{aligned} \left( \mathsf{WP} \right) \right\| $$

The execution in Fig. 4 exhibits non-SC behavior though all the memory access pairs result in ppo relations due to data dependencies. Even an intermediate full fence in one of these threads cannot restrict the relaxed behavior.

• We evaluate the role of same-location program-order relation in defning robustness conditions. On ARMv7, samelocation read-write access pair is unordered (see ARM-Weak [16] example in Fig. 2). Yet if all *external-programorders* (see §III) are on same-location or have intermediate fences then the program exhibits only SC behavior.

In §IV we propose static analyses to check if a program is M-K robust based on the respective conditions. Otherwise we insert fences to enforce robustness. These analyses are computed in polynomial time as shown in § IV-C unlike the robustness checkers which explore program executions and are of signifcantly higher computational complexity.

The robustness checking procedures analyze the programs with thread functions. In these programs each thread function may result in any number of concurrent threads in an execution. Thus our analysis is parameterized by the thread functions and the analyses are applicable to all the programs having same thread functions.

We have implemented the analyses procedures in a tool called *Fency* based on LLVM [17] and have evaluated on several well known concurrent programs [8, 14]. We compare the SC-x86 robustness analysis of *Fency* to existing SC-TSO robustness results of Trencher [8] that explore program executions by model checkers. Yet, *Fency* is quite precise and matches Trencher in most of the programs. Moreover, *Fency* does not use external model checkers or SAT/SMT solvers and therefore is signifcantly fast in most of the cases.

We also compare *Fency* to a *naive* fence insertion scheme that do not use robustness analysis. *Fency* inserts signifcantly fewer fences than the naive scheme in several benchmarks. Moreover, empirical evaluations show that if a model W is weaker than M then ensuring W-K robustness often requires fewer fences than ensuring M-K robustness. Thus precise robustness analysis is indeed benefcial for many cases instead of using SC-robustness checkers.

Outline and Contributions. §II reviews the concurrency models. §III proposes the M-K robustness conditions. §IV explains our approach to check and enforce robustness. §V examine the experimental results. §VI discusses the related work and we conclude in §VII. The proofs and additional details are in the supplementary material [15].

#### II. CONCURRENCY MODELS

In this section we review SC, x86, ARMv8, and ARMv7 concurrency. For all models we follow a common syntax.

$$\begin{aligned} E &:= r \mid v \mid E + E \mid E \* E \mid E \le E \mid \cdots\\ C &:= \mathsf{skip} \mid C; C \mid t = E \mid t = X \mid X = E \mid \mathsf{RMW}(X, E, E)\\ &\mid \mathsf{Fence} \mid \mathsf{RMW}(X, E) \mid \mathsf{br} \mid \mathsf{label} \mid \mathsf{br} \mid \mathsf{label} \mid \cdot \cdot\\ P &:= X = v; \cdots \cdot X = v; \{C \mid \| \cdot \cdot \cdot \| \mid C \} \end{aligned}$$

An expression E results from thread-local temporary (t), value (v), and arithmetic operations (E). Command t = X returns the value of a shared memory location X to a thread-local register r and X = E writes the evaluation of expression E to X. The RMW(X, Er, Ew) atomically compares the values of X and Er; if equal then X is written to the value of E<sup>w</sup> and set r. If the value of X is not equal to the value of E<sup>r</sup> then the RMW fails. Command RMW(X, Er) atomically updates the value of X with the value of E<sup>r</sup> and returns the value of X to r. A failed RMW performs only read access. A fence orders certain memory accesses. We use conditional and unconditional branches for program's control fow. Finally, a program consists of a set of initialization writes followed by a parallel composition of thread commands. Unless otherwise mentioned, the initializations set all memory locations to zero.

#### *A. Program Semantics and Execution Graphs*

We follow the axiomatic models for all architectures [18, 19, 20, 21, 22, 23, 24, 25, 26]. In these axiomatic models a program's semantics is defned by a set of consistent executions. An execution consists of a set of events and relations.

Event. An event ⟨id,tid, lab⟩ consists of unique identifer id, thread identifer tid ∈ N, and a label lab based on the respective executed memory or fence access. A label is of the form ⟨op, loc, val⟩ where op, loc, and val are operation type, location, and read or written value.

Preliminaries. Given a binary relation P on events, dom(P) and codom(P) are its domain and its range. P −1 , P ? , P +, and P ∗ are inverse, refexive, transitive, and refexive-transitive closures of P respectively. P<sup>ℓ</sup> denotes P related event pairs on same locations i.e. P<sup>ℓ</sup> ≜ {(e, e′ ) ∈ P | e.loc = e ′ .loc} and P̸=<sup>ℓ</sup> ≜ P \ P<sup>ℓ</sup> denote the P related event pairs on different locations. imm(P) defnes the immediate P relation, i.e. imm(P) ≜ ∃a, b. P(a, b) ∧ ∄c. P(a, c) ∧ R(c, b). P ; S is the relational composition of the binary relations P and S. Finally, [A] is an identity relation on a set A.

R, W, and F are the set of read, write, and fence events. The events are related by primitive relations: strict partial order program-order (po) captures the syntactic order among the events, reads-from (rf) relates a write event to a read event that justifes its read value, and strict total order coherenceorder (co) relates same-location writes.

Execution. An execution is of the form X = ⟨E, po,rf, co⟩ where X.E is the set of events in X. The set of po, rf, and co relations between the events in X.E are X.po, X.rf, and X.co. Execution X is *well-formed* if X.po is total in each thread and every read reads-from some write, i.e. X.R ⊆ codom(X.rf).

We derive a number of relations from these primitive relations. Relation rmw ⊆ imm(po) ∩ ([R] × [W])<sup>ℓ</sup> denotes atomic update where a read has an immediate po-successor write on the same location. The non-rmw read and write events are load (Ld) and store (St) events.

$$\mathsf{Ld} \triangleq \mathsf{R} \mid \mathsf{dom}(\mathsf{rm}) \qquad \qquad \mathsf{St} \triangleq \mathsf{W} \mid \mathsf{codom}(\mathsf{rm}) \mid$$

A successful RMW generates an rmw and a failed RMW generates a Ld event. We use a· b ≜ [{a}]; imm(po); [{b}] to denote that a and b are immediate po related events.

Relation WR denotes a write-read event pair on different locations that does not have any intermediate rmw.

$$\mathsf{WR} \triangleq \left( [\mathsf{W}]; \mathsf{po}\_{\neq \ell}; [\mathsf{R}] \right) \backslash \left( \mathsf{po}; rm\mathsf{w}; \mathsf{po} \right)$$

The from-read (fr) relation relates a pair of same-location read and write events r and w where r reads-from a write w ′ which is co-before w, that is, fr ≜ rf<sup>−</sup><sup>1</sup> ; co. For example, in Fig. 1a the R(X, 0) and W(X, 1) events are in fr relation.

We categorize the relations as external and internal based on whether the events are also in po relation. Considering rf, co, and fr relations rf, coi, fri and rfe, coe, fre denote the internal and external relations respectively.


For example, the rf and fr edges in Fig. 1a edges are rfe and fre edges respectively. Based on the rfe, coe, and fre we defne *extended-coherence-order* (eco) on same location events: eco ≜ (rfe ∪ coe ∪ fre) +.

Consistency Axioms. An axiomatic model is defned by a set of axioms. An execution is consistent in a model if it satisfes all its axioms. An axiom violation can be captured by a cycle on the respective execution graph.

#### *B. Formal Models*

Now we move to the axiomatic defnitions based on various relations. We elide some defnitions here due to space constraint which we discuss in the technical appendix [15].

In these models a store access writes value v on location x and generates an event with label W(x, v). A load access reads value v from x and generates an event with label R(x, v). A successful RMW on x reads value v ′ and writes value v to generate a pair of R(x, v′ ) and W(x, v) events that are in rmw relation. A failed RMW generates an R(x, v′ ) event. The full fences in x86, ARMv8, and ARMv7 are MFENCE, DMBFULL, and DMB respectively. A full fence generate an event with label F. ARM architectures also provides ISB fence to order a pair of reads. In ARMv7 an ISB access along with control (cmp) and jump (bc) instructions generate cmp; bc; ISB that result in ctrlISB between a pair of read events in an execution [19]. In ARMv8 an ISB generates an ISB event.

ARMv8 Specifc Accesses. In addition, ARMv8 has synchronizing memory accesses such as release write, acquire read, and acquirePC load which are denoted by events with label L(x, v), A(x, v), and Q(x, v). ARMv8 also provide DMBLD and DMBST fences that generate FLD, and FST events. Finally, L ⊆ W, A ⊆ R, Q ⊆ Ld ⊆ R, and F, FLD, FST are the set of release, acquire, acquirePC, and full, load, store fence events.

All these models satisfy coherence and atomicity properties. *Coherence.* The property enforces SC per location i.e. in an execution all accesses on same memory locations are totally ordered. A complete execution graph X satisfes coherence if X.po<sup>ℓ</sup> ∪ X.rf ∪ X.co ∪ X.fr is acyclic.

*Atomicity.* An execution X violates atomicity if there is an intermediate write on same location between rmw related read and write events. In that case X.fre(r, w) and X.coe(w ′ , w) hold where r and w are X.rmw-related events and w ′ is another write on the same location as r and w.

SC. An well-formed execution X is SC when:

• (X.po ∪ X.rf ∪ X.fr ∪ X.co) is acyclic (SC)

• X.rmw ∩ (X.fre; X.coe) = ∅ (atomicity)

The executions in Fig. 1 are inconsistent in SC. For example, the SB execution has po ∪ fr cycle. Note that coherence constraint is included in (SC) axiom as po<sup>ℓ</sup> ⊆ po holds and therefore if (X.po ∪ X.rf ∪ X.fr ∪ X.co) is acyclic then (X.po<sup>ℓ</sup> ∪ X.rf ∪ X.fr ∪ X.co) is also acyclic.

Fig. 1: Distingushing executions: SB execution is disallowed in SC but allowed in x86 and ARM. SC and x86 disallow LB execution but ARM models allow it. IRIW execution is disallowed in SC, x86, ARMv8, but allowed in ARMv7.

x86. Relation x86-preserved-program-order (xppo) orders read-read, read-write, write-write access pairs. Relation implied signifes that an intermediate rmw or F acts as a full fence. Based on these relations x86 defnes x86-happensbefore (xhb). Finally, x86 defnes its consistency constraints for a well-formed execution.

	- xppo ≜ ((W × W) ∪ (R × W) ∪ (R × R)) ∩ po

$$- \text{ implied} \stackrel{\Delta}{=} \mathsf{po}; [\mathsf{dom}(rm\mathsf{m}) \cup \mathsf{F}] \cup [\mathsf{codom}(rm\mathsf{m}) \cup \mathsf{F}]; \mathsf{po}$$

x86 satisfes coherence and atomicity by (sc-per-loc) and (atomicity) axioms respectively. Axiom (GHB) ensures a global order based on xhb relation. The model allows Fig. 1a but disallows the executions in Figs. 1b and 1c.

ARMv8. In ARMv8 relation observed-by (obs ⊆ eco) relates same-location external events. Relation atomic-orderedby (aob ⊆ po<sup>ℓ</sup> ) orders events based on rmw and acquire or acquirePC events. The dependency-ordered-before (dob) captures dependency based ordering between events e.g. data∪ addr ⊆ dob. Relation barrier-ordered-by (bob) orders events by fences and stronger memory accesses as follows.

$$\begin{aligned} \mathsf{bob} \stackrel{\scriptstyle \Delta}{=} \mathsf{po}; [\mathsf{F}]; \mathsf{po} \cup [\mathsf{R}]; \mathsf{po}; [\mathsf{F}\_{\mathsf{L}\mathsf{D}}]; \mathsf{po} \cup [\mathsf{M}]; \mathsf{po}; [\mathsf{F}\_{\mathsf{sr}}]; \mathsf{po}; [\mathsf{M}] \\ \cup [\mathsf{L}]; \mathsf{po}; [\mathsf{A}] \cup \mathsf{po}; [\mathsf{L}] \cup [\mathsf{A} \cup \mathsf{Q}]; \mathsf{po} \cup \mathsf{po}; [\mathsf{L}]; \mathsf{co} \end{aligned}$$

A full fence orders all accesses, a load fence orders a read with its successors, and a store fence orders a pair of writes. A release access is ordered with its predecessors and an acquire or acquirePC is ordered with its successors. Release and acquire accesses are ordered. Finally, (a, b) is ordered if b is a write and there is an intermediate release store on the same-location as b. Based on these relations ARMv8 defnes *Ordered-before* (ob) order: ob ≜ (obs∪dob∪aob∪bob) <sup>+</sup>. A well-formed ARMv8 execution X is consistent when:



These axioms allow the executions in Figs. 1a and 1b but disallows the execution in Fig. 1c by the (external) axiom.

$$\begin{array}{l} a = X; \\ X = 1; \\ X = 1; \end{array} \Big| \begin{array}{l} \mathsf{R}(X,1) \quad \mathsf{\mathcal{R}}(X,1) \\ \mathsf{\mathcal{P}o}\_{\ell} \quad \mathsf{\mathcal{P}pos} \\ \mathsf{\mathcal{W}}(X,1) \quad \mathsf{\mathcal{W}}(Y,1) \quad \mathsf{\mathcal{W}}(X,1) \end{array} \Big| \begin{array}{l} \mathsf{\mathcal{R}}(Y,1) \\ \mathsf{\mathcal{R}}(Y,1) \\ \mathsf{\mathcal{N}}(X,1) \quad \mathsf{\mathcal{W}}(Y,1) \end{array} \Big| \begin{array}{l} \mathsf{\mathcal{R}}(Y,1) \\ \mathsf{\mathcal{N}}(X,1) \end{array}$$

Fig. 2: Outcome a = 1 is allowed in ARMv7.

ARMv7. ARMv7 orders memory accesses in a thread by *preserved-program-order* (ppo) based on dependencies or fence ⊆ po; [F]; po relation. ARMv7 also defnes happensbefore (ahb) and propagation (prop ⊆ R1; fence; R2) relations that can order events across threads. Finally a well-formed ARMv7 execution X is consistent when:


Axiom (observation) constrains the set of writes from which reads may read-from; if a write w is in prop; ahb<sup>∗</sup> relation with a same-location read r then r does not read from w ′ which is co-before w. (propagation) ensures that prop does not contradict co and (no-thin-air) constrain causality cycle.

ARMv7 allows the executions in Fig. 1 including IRIW with a = c = 1, b = d = 0 outcome in the following program.

$$X[1] = 1; \quad \left\| \begin{array}{c} a = X[1]; \\ b = Y[a]; \end{array} \right\| \begin{array}{c} c = Y[1]; \\ d = X[c]; \end{array} \right\| Y[1] = 1; \text{ ( $\mathbb{R}\mathbb{M}\mathbb{M}$ )}$$

In addition read-write accesses on same-location can be unordered in ARMv7. As a result, the ARM-Weak program in Fig. 2 has an execution with a = 1 outcome.

#### III. ROBUSTNESS ANALYSIS AND ENFORCEMENT

In this section we frst defne M-K robustness and then propose the M-K robustness conditions.

# Defnition 1. *A program is* M*-*K *robust if all its* K*-consistent executions are also* M*-consistent.*

Suppose a K-consistent execution X violates an axiom from M-consistency. The violation results in a cycle in X. If the cycle contains no po edge then it is formed by rfe, fre, and coe edges on same location events. The cycle also violates coherence. This is not possible as execution X is K-consistent and all K models we are considering satisfy coherence. So the cycle consists of a set of po-edges along with the eco edges between them. We defne these po edges as *external-programorder* (epo) i.e. epo ≜ po ∩ (codom(eco) × dom(eco)).

Thus we represent an axiom violation as a (epo; eco) <sup>+</sup> cycle where all the epo edges on the cycle are not suffciently ordered. To enforce order we insert fences to strengthen these epo edges and restrict a cycle to enforce M-K robustness.

Fig. 3: Coherence ensures eco; epo<sup>ℓ</sup> ∪ epo<sup>ℓ</sup> ; eco ⊆ eco.

Theorem 1. *A program* **P** *is* M*-*K *robust if in all its* K*consistent execution* X*,* X.epo ⊆ X.R *holds where* R *is defned as* M*-*K *condition as follows.*


Next, we explain the M-K conditions for the concurrency models. The correctness proofs for these robustness conditions are in the technical appendix [15].

#### *A. Robustness of x86 Programs*

From the SC-x86 condition in Theorem 1, relation xppo orders read-read, read-write, and write-write pairs. So if an x86 execution violates SC-x86 robustness then it contains a (epo; eco) <sup>+</sup> cycle with one or multiple epo edges that are in WR relation. If it is on same location then there is an alternative (eco; epo) <sup>+</sup> cycle as shown in Fig. 3 that also denote the violation. The implied; po? relation can order a write-read pair by intermediate rmw or F.

Consider the SB execution from Fig. 1a in x86. The epo edges do not satisfy SC-x86 condition and the execution is non-SC. If we insert fences between the store-load pairs in each thread then the program exhibits only SC behaviors.

#### *B. Robustness of ARMv8 Programs*

SC-ARMv8 Robustness. Suppose an ARMv8 execution contains a (epo; eco) <sup>+</sup> cycle that violates SC-ARMv8 robustness. If an epo<sup>ℓ</sup> edge is on the cycle then as shown in Fig. 3 there is an alternative (epo; eco) <sup>+</sup> cycle without the edge.

Now consider an (epo; eco) <sup>+</sup> cycle where each epo on the cycle is in (aob ∪ bob ∪ dob) <sup>+</sup> relation. In that case ((aob ∪ bob ∪ dob) <sup>+</sup>; eco) <sup>+</sup> cycle implies an ob cycle which is not possible as an ARMv8 consistent execution satisfes (external). The epo edges in SB and LB executions in Fig. 1 do not satisfy the SC-ARMv8 condition. The executions are allowed in ARMv8 but not in SC.

x86-ARMv8 Robustness. The x86-ARMv8 robustness condition orders all epo relations except WR pairs as WR is also unordered in x86. Hence an ARMv8 execution exhibits only x86 behavior if the x86-ARMv8 condition holds. Consider the SB execution from Fig. 1a in ARMv8; both the epo edges are also in WR and the execution is x86 consistent.

$$\begin{array}{c|c} \mathsf{R}(T,1) & \star \\ \hline \mathsf{W}(X,2) & \star \\ \mathsf{W}(X,1) & \star \\ \end{array} \begin{array}{c|c} \mathsf{R}(X,2) & \mathsf{R}(Y,2) \\ \hline \mathsf{W}(X,1) & \star \\ \mathsf{W}(Y,2) & \mathsf{W}(Z,2) \\ \end{array} \begin{array}{c|c} \mathsf{R}(Z,1) & \star \\ \hline \mathsf{W}(Z,1) & \\ \hline \mathsf{W}(Y,1) & \\ \end{array} \end{array}$$

Fig. 4: ARMv7 allows the execution of the WP program.

#### *C. Robustness of ARMv7 Programs*

SC-ARMv7 Robustness. The ARMv7 model uses po<sup>ℓ</sup> and fence relations to order epo edges for SC-ARMv7 robustness.

The ppo and po<sup>ℓ</sup> do not guarantee SC-ARMv7 robustness as shown in the execution in Fig. 2. If we insert fences in the second and third threads the execution is disallowed in ARMv7 and the resulting program is SC-ARMv7 robust.

Moreover, ppo relations in all epo edges do not ensure SC behavior in an execution. For instance, the WP program execution in Fig. 4 is non-SC even though the epo edges are ppo-ordered. Note that, even if we insert an intermediate DMB in one of the threads the cycle is still possible in ARMv7.

x86-ARMv7 Robustness. To ensure x86-robustness, ARMv7 orders all epo relations except write-read pairs. Consider the SB program execution in Fig. 1a where the epo edges are WR pairs and the execution is consistent in both ARMv7 and x86.

ARMv8-ARMv7 Robustness. ARMv8-ARMv7 robustness requires to order all epo̸=<sup>ℓ</sup> relations except write-read and write-write pairs. In this case also ppo relation cannot order epo̸=<sup>ℓ</sup> edges. Hence the cycle in the ARMv7 execution in Fig. 4 is disallowed in ARMv8 as it is an ob cycle.

#### IV. CHECKING AND ENFORCING ROBUSTNESS

In this section we lift the semantic notion of M-K robustness to the program syntax and propose static analyses to check and enforce robustness in the following steps.


#### *A. MPG Construction*

Let {f1, f2, . . . , fn} be the set of thread functions in a program that may run in parallel. Let C = ⟨V, E⟩ be a control

Fig. 5: Subgraph of SB2 MPG with potential epo and eco edges. SB2(true) || SB2(false) violates SC-x86 robustness.

fow graph (CFG) of a thread function where C.V are the instruction nodes and C.E are the set of control fow edges. We analyze the thread functions' CFGs to construct an MPG.

Helper Defnitions. We defne following helper conditions.


Defnition 2. *An MPG is of the form* **G** = ⟨**V**, **E**⟩ *where* **G**.**V** *is the set of shared memory access pairs and* **G**.**E** *denote the set of edges between the nodes. An edge from* (a, b) ∈ **G**.**V** *to* (c, d) ∈ **G**.**V** *implies that* b *and* c *may access same location.*

Procedure BuildG in Fig. 6 constructs an MPG. In BuildG line 2-4 appends the memory access pairs from CFG(f1), CFG(f1), . . . , CFG(fn) to **V**. Line 5-8 compute the **G**.**E** edges. An edge between (a, b) and (c, d) denotes that mayAA(b, c) holds. Note that we also create **G**.**E** edges between access pairs from the same thread function. It is because multiple concurrent threads may execute same thread function and access pairs from a function may result in events which are concurrent in an execution. In this case we effectively analyze all programs of the form f<sup>1</sup> || · · · f<sup>1</sup> || · · · || f<sup>n</sup> · · · || fn.

#### *B. Checking robustness on MPG*

A cycle in MPG **G** implies a potential (epo; eco) <sup>+</sup> cycle in an execution. Cy(**G**) returns the set of access pairs that may create cycle(s) in the MPG **G** i.e.

$$\begin{aligned} \mathsf{Cy}(\mathbb{G}) & \stackrel{\Delta}{=} \{ n \mid n \in \mathbb{G}. \mathbb{V} \land \exists m, o \in \mathbb{G}. \mathbb{V}. \\ & m \neq n \land o \neq n \land \mathbb{G}. \mathbb{E}(m, n) \land \mathbb{G}. \mathbb{E}(n, o) \} \end{aligned}$$

We do create any self loop in **G** on n. A self loop on n implies that n may create concurrent event pair (p, q) and (r, s) in an execution where eco(q, r) or eco(p, s) holds which implies (p, q),(r, s) ∈ po<sup>ℓ</sup> . However, po<sup>ℓ</sup> is included in all M-K robustness condition and therefore multiple event pairs from n does not create any new robustness violation.

If Cy(**G**) has any unordered access pair following respective Ord condition then we report M-K robustness violation.

example. Consider the SB2 function in Fig. 5. The program SB2(true) || SB2(false) violates SC-x86 robustness due to an execution where R(Y, 0) and R(X, 0) is possible in the frst and second threads respectively. We construct the MPG from {1, 2, 3, 4} accesses. The subgraph in Fig. 5 contains a cycle of (1, 3) and (2, 4) that depicts SC-x86 robustness violation.

#### 1) Defning Ord Conditions

To defne an Ord condition we use the following defnitions.


$$\begin{array}{l} \mathcal{P}\_{\mathsf{nf}}(\mathcal{C}, i, j, F) \triangleq \mathcal{P}(\langle \mathcal{C}. \mathcal{V} \, \langle \, F, \mathcal{C}. \mathcal{E} \, \backslash B \rangle, i, j) \\ \text{where } B = (G. \mathcal{V} \times F) \cup (F \times G. \mathcal{V}) \end{array}$$


$$\begin{aligned} \mathsf{isWR}(\mathcal{C}, i, j) &\stackrel{\scriptstyle \Delta}{=} \mathsf{isW}(i) \land \mathsf{isR}(j) \land \neg \mathsf{mustA}(i, j) \\ &\land \exists u \; (u \in \mathsf{ac}(\mathcal{C}, \mathsf{rmw}) \\ &\qquad \land \mathcal{P}(\mathcal{C}, i, u) \land \mathcal{P}(\mathcal{C}, u, j)) \end{aligned}$$

x86. The Ord condition for SC-x86 robustness is as follows.

$$\begin{aligned} \mathsf{Ord}(\mathsf{SC}, \mathsf{x86}, \mathcal{C}, i, j) & \stackrel{\scriptstyle \Delta}{=} \mathsf{isR}(i) \lor \mathsf{isW}(j) \lor \mathsf{mustA}(i, j) \\ & \lor \neg \mathcal{P}\_{\mathsf{nf}}(\mathcal{C}, i, j, \mathsf{ac}(\mathcal{C}, \mathsf{F})) \end{aligned}$$

The isR(i) and isW(j) conditions ensure xppo relations between the events generated from i and j. mustAA(i, j) checks if i and j generated events pairs are in epo<sup>ℓ</sup> relation. The Pnf condition checks if there are intermediate fences between i and j generated events in all executions. The Ord condition is satisfed in LB and IRIW but violated in the SB program.

In x86 a successful RMW results in rmw which acts as an intermediate fence. But a failed RMW generates a read event only and it does not act as a fence. Therefore an RMW operation between a pair of memory access does not ensure that the access pair is ordered in all execution. However, if an RMW is used in an *wait*-loop where the loop terminates only when the RMW is successful then the RMW in the *wait*-loop acts as a fence in all x86 terminating executions. For these programs we strengthen SC-x86 robustness checking condition as follows.

$$\begin{aligned} \mathsf{SOrd}(\mathsf{SCL}, \mathsf{x86}, i, j) &\stackrel{\triangle}{=} \mathsf{isR}(i) \lor \mathsf{isW}(j) \lor \mathsf{mustA}(i, j) \\ &\lor \neg \mathcal{P}\_{\mathsf{nf}}(\mathcal{C}, i, j, \mathsf{ac}(\mathcal{C}, \mathsf{F} \cup \mathsf{rmw})) \end{aligned}$$

ARMv8(A8). isL(i), isA(i), isAQ(i) check if an access i is a release, acquire, acquire/acquirePC respectively. isLA(i, j) holds for a release, acquire access pair (i, j). Lcoi(i) returns the set of release-writes that access same-location as i. RA(C, i) returns the set of acquire-reads that is reachable from i through some release-writes.

$$\begin{aligned} RA(\mathcal{C}, i) & \triangleq \{ a \mid \mathsf{isA}(a) \land \neg \mathcal{P}\_{\mathsf{nf}}(\mathcal{C}, i, a, \mathsf{ac}(\mathcal{C}, \mathsf{L})) \} \\ \mathsf{Lcoi}(\mathcal{C}, i) & \triangleq \{ w \mid \mathsf{isL}(w) \land \mathsf{mustA}(w, i) \} \end{aligned}$$

We now defne the Ord condition for SC-ARMv8 robustness where B ≜ ac(C, F) ∪ RA(i). It results in B<sup>F</sup> = po; [F]; po ∪ po; [L]; po[A]; po ⊆ bob that acts as a fence on an epo. Moreover we defne isRR(i, j) ≜ isR(i) ∧ isR(j), isRW(i, j) ≜ isR(i) ∧ isW(j), isWW(i, j) ≜ isW(i) ∧ isW(j).

$$\mathsf{Ord}(\mathsf{SC}, \mathsf{A8}, \mathcal{C}, i, j) \triangleq \mathsf{must} \mathsf{AA}(i, j) \tag{1}$$

∨(¬Pnf(C, i, j, B)) ∨ isLA(i, j) ∨ isAQ(i) ∨ isL(j) (2)

$$\vee(\mathsf{is}\mathsf{R}\mathsf{R}(i,j)\wedge\neg\mathcal{P}\_{\mathsf{nf}}(\mathcal{C},i,j,B\cup\mathsf{act}(\mathcal{C},\mathsf{F}\_{\mathsf{LD}})))\tag{3}$$

∨(isRW(i, j)∧¬Pnf(C, i, j, B∪ac(C, FLD)∪Lcoi(C, j))) (4)

$$\bigvee(\mathsf{isWM}(i,j)\land\neg\mathcal{P}\_{\mathsf{nf}}(\mathcal{C},i,j,B\cup\mathsf{ac}(\mathcal{C},\mathsf{F}\_{\mathsf{ST}})\cup\mathsf{Lcoi}(\mathcal{C},j)))\Big)\tag{5}$$

The defnition ensures that the generated events from i and j are in (1) po<sup>ℓ</sup> or in one of the following bob relations: (2) B<sup>F</sup> ∪ [L]; po; [A] ∪ [A ∪ Q]; po ∪ po; [L], (3) B<sup>F</sup> ∪ [R]; po; [FLD]; po, (4) B<sup>F</sup> ∪ [R]; po; [FLD]; po ∪ po; [L]; coi, (5) B<sup>F</sup> ∪[W]; po; [FST]; po; [W]∪po; [L]; coi. The overall condition ensures SC-ARMv8 robustness. The condition is satisfed in IRIW but violated in SB and LB.

The dob and aob relations also order memory accesses. From the defnition aob ⊆ po<sup>ℓ</sup> which is already captured by (1). We do not include dob in the Ord condition as a dependency can be optimized away after the robustness analysis which may result in a non-robust program even when we report the original program to be robust.

Next, we defne x86-ARMv8 robustness condition where an (i, j) access pair is ordered or may generate a WR pair.

$$\mathsf{Ord}(\mathsf{x86}, \mathsf{A8}, \mathcal{C}, i, j) \triangleq \mathsf{Ord}(\mathsf{SC}, \mathsf{A8}, \mathcal{C}, i, j) \lor \mathsf{isWRR}(\mathcal{C}, i, j)$$

SB and IRIW satisfy the condition but LB violates it.

ARMv7(A7). We defne the Ord condition to ensure the SC-ARMv7 robustness condition in all ARMv7 executions. Then we extend the Ord for SC-ARMv7 to defne the Ord conditions for x86-ARMv7 and ARMv8-ARMv7 robustness.

Ord(SC, A7, C, i, j) ≜ mustAA(i, j)∨(¬Pnf(C, i, j, ac(C, F))) Ord(x86, A7, C, i, j) ≜ Ord(SC, A7, C, i, j)∨isWR(C, i, j) Ord(A8, A7, C, i, j) ≜ Ord(SC, A7, C, i, j)∨isW(i)

The memory access pairs in the LB program satisfes the ARMv8-ARMv7, and the SB program satisfes the x86- ARMv7, ARMv8-ARMv7 conditions.

#### 2) Robustness Analysis and Enforcement Procedure

The MKRobust procedure in Fig. 6 checks M-K robustness on an MPG **G**: (line 3) we frst compute Cy(**G**). (line 4-7) if an access pair (a, b) in Cy(**G**) is on a cycle then we check if (a, b) is ordered by the Ord condition. (line 8) returns the unordered memory access pairs O.

If O is empty then the program is M-K robust. Else Enforce procedure insert appropriate fences to enforce robustness. Procedure getF returns a fence based on the access type a and

1: procedure BuildG({f1, . . . , fn}) 2: for f ∈ {f1, . . . , fn} do 3: C ← CFG(f); 4: **V** ← **V** ∪ MM(C); 5: for (a, b) ∈ **V** do 6: for (c, d) ∈ **V** do 7: if mayAA(b, c) then 8: **E** ← **E**∪ {(a, b),(c, d)}; 9: return ⟨**V**, **E**⟩; 10: end procedure 1: procedure MKRobust(M, K, **G**) 2: O ← ∅; 3: AB ← Cy(**G**); 4: for (a, b) ∈ AB do 5: C ← getG(b); 6: if ¬Ord(M,K, C, a, b) then 7: O ← O ∪ {(a, b)}; 8: return O; 9: end procedure 1: procedure Enforce(K, O) 2: H ← ∅; 3: for (a, b) ∈ O do 4: if b /∈ H then 5: f ← getF(K, a, b); 6: insertF(getG(b), a, b, f); 7: H ← H ∪ {b}; 8: end procedure

$$
\mathbb{G} \leftarrow \mathsf{BuildG}(\{f\_1, \ldots, f\_n\}); \ O \leftarrow \mathsf{MKRoubust}(M, K, \mathbb{G}); \ \mathsf{Enforce}(K, O);
$$


4: if K == A8 then


8: end procedure

1: procedure insertF(C, a, b, f) 2: V ′ ← C.V ∪ {f}; 3: E<sup>1</sup> ← C.E ∪ {(f, b)} 4: E ′ ← E<sup>1</sup> ∪ {(e, f)| C.E <sup>+</sup>(e, b)}∪{(f, e)| C.E <sup>+</sup>(b, e)} 5: return ⟨V′ , E ′ ⟩;

6: end procedure

Fig. 7: Procedure getF and insertF.

b in the memory model K. Procedure insertF inserts the fence f between a and b. Note that one inserted fence may order multiple access pairs. These methods are defned in Fig. 7. In case of x86 and ARM programs we insert MFENCE and DMB respectively. In ARMv8 we frst insert DMBFULL followed by DMBLD and then DMBST fences.

### *C. Complexity of Robustness*

To analyze the complexity of the robustness algorithm we analyze the main procedures: BuildG, MKRobust, and Enforce which perform MM, Pnf, and Cy computations. Given a program with n statements, the number of shared memory accesses and control fow edges are bound by n and n 2 respectively. Hence MM contain maximum n 2 elements and Pnf computation is bound by traversing n 2 edges. So procedure BuildG constructs an MPG graph with maximum |MM|= n 2 nodes and |MM| <sup>2</sup>= n 4 edges. Hence Cy computation traverses maximum n 4 edges. In procedure MKRobust, for each node in MPG, we check (i) if it is on the cycle by computing Cy (ii) if yes then it performs Pnf computation for the memory access pair. Hence MKRobust overall incurs n <sup>2</sup> ∗(n <sup>4</sup>+n 2 ) = n <sup>6</sup>+n 4 computation. Next, procedure Enforce takes maximum n 2 computation for each access pair in MM and for overall incurs maximum n <sup>2</sup>∗ |MM|= n 4 computation. Hence, the robustness checking and enforcement computation is bounded by O(n 6 ) which is polynomial in terms of the program size.

### V. EXPERIMENTAL EVALUATION

Implementation. We implement the robustness analysis and enforcement techniques in *Fency* (for FENCe analYsis) as LLVM compiler passes for x86, ARMv8, and ARMv7 programs. We leverage the existing analyses in LLVM. The CFG analyses are used to defne MM, Path, P, and Pnf conditions. We defne the mayAA and mustAA conditions using memory operand type and alias analyses provided in LLVM.

We run the analyses on a MacOS machine having a 2.4GHz 8-Core Intel i9 processor with 64 GB RAM.

Benchmarks. We analyze a number of well-known concurrent algorithms and data structures [14, 27] including global barrier (Barrier) construct, mutual exclusion algorithms (by Dekker, Peterson, and Lamport), different lock algorithms (e.g. Spinlock, Seqlock, Ticketlock), non-blocking write protocol (NBW), read-copy-update (RCU) programs, work-stealing queue in Cilk, and ChaseLev dequeue. These programs use C11 [28, 29] atomic accesses extensively. The releaseacquire(RA)/TSO/SC versions indicate the memory model for which the respective version is developed. The number of lines in the LLVM IR (.ll) fles vary between 100-400 which indicate the approximate size of an analyzed CFG.

Naive fence insertion scheme. We compare *Fency* to a naive scheme which does not use robustness information in fence insertion. The naive scheme works as follows.

	- (x86) Insert MFENCE after load, store, and RMW accesses.
	- (ARMv8) Insert DMBLD after non-acquire loads and DMBFULL for other memory accesses.
	- (ARMv7) Insert DMB after all memory accesses.

#### *A. Experimental Results*

In Figs. 8 and 9 we report the results of some benchmarks. The full results are in the supplementary material [15]. For comparison we also provide the number of fences required by


Fig. 8: Robustness analyses and enforcement for x86 and ARMv7 programs.

the naive schemes as well as the results from state-of-the-art x86-robustness checker Trencher [8].

Intrpreting the Results. The (SC-K) entries in the tables are of the form (a|b(✓/✗) c ⟨ d) where


In ARMv8 we show total number of DMB(FULL/LD/ST) fences. We use #(a-(b+c)) less fences than the naive schemes e.g. from Fig. 8 the Barrier program requires 6-(0+2)=4 less fences than the naive scheme to enforce SC-x86 robustness.

For Trencher we analyze the encoded programs taken from [14]. We report if the program is SC-x86 robust (✓/✗), number of inserted fences (i.e. 'c') and the execution time (i.e. 'd'). Trencher fence insertion does not terminate for RCU-offine.

### 1) Checking Robustness

x86 programs. We report the SC-x86 robustness analysis results of *Fency* in Fig. 8 (and in [15]) and compare the results from Trencher. on the corresponding programs.

The SC-x86 robustness analysis in *Fency* is quite precise and agrees to Trencher in all cases except Lamport-RA, Lamport-TSO, and Cilk-SC programs. Lamport-(RA/TSO) have unordered write-read pairs that generate WR relations and hence *Fency* report SC-robustness violation though these access pairs never execute concurrently in any x86 execution. Moreover, in most cases *Fency* insert same number of fences as Trencher.

We note a subtle case in Cillk-SC. It has an access sequence a = RRLX(T); WRLX(T, a-1); RRLX(H). Trencher reports SCviolation due to the WR pair. However, LLVM combines the load and store of T and create an atomic fetch-and-sub: a = RRLX(T); WRLX(T, a-1) ⇝ a = fsub(T, 1). Hence the resulting x86 program ensures SC-robustness which *Fency* reports correctly.

We also note the execution time of *Fency* and of Trencher. Trencher incurs signifcantly more time for the Seqlock, Cilk-


Fig. 9: Robustness analyses & enforcement in ARMv8.

TSO, Cilk-SC programs and does not terminate for RCUoffine fence insertion. Trencher exhibits comparable effciency in certain programs e.g. Spinlock, Ticketlock. However, in these programs also if we increase the number of threads by replicating the thread functions then Trencher incurs orders of seconds to check and enforce robustness. At the same time Trencher inserts more fences. On the other hand, the analyses in *Fency* are parameterized by thread functions and therefore are unaffected by the number of executing threads.

ARMv8 programs. In Fig. 9 (and in [15]) we report the robustness results of the ARMv8 programs. The ARMv8 programs violate SC and x86 robustness as the programs contain independent memory accesses on different locations which are unordered in ARMv8.

As ARMv8 is weaker than x86, the programs (e.g. Barrier) which violate SC-x86 robustness also violate SC-ARMv8 robustness. Moreover, there are programs which are SC-x86 robust but violates SC-ARMv8 robustness such as dekker-TSO. These programs violate both SC-ARMv8 and x86- ARMv8 robustness due to unordered accesses that result in [R]; po̸=<sup>ℓ</sup> ; [R] or [W]; po̸=<sup>ℓ</sup> ; [W] relation in an execution. These access pairs are ordered in x86 but not in ARMv8 and hence violate x86-ARMv8 robustness.

Robustness of ARMv7 programs. In general the ARMv7 programs violate robustness when x86 or ARMv8 are not robust as shown in Fig. 8 (and in [15]). However, C11 release/acquire/SC accesses which generate full fences in ARMv7 and synchronizing accesses in ARMv8 which act as half fences. As a result, in some programs the ARMv7 version enforce stronger ordering than the ARMv8 version. Hence the ARMv7 programs are robust unlike the ARMv8 programs. For example, Consider the C11 event (without read/written values) sequences from Spinlock and Ticketlock programs and their C11 to ARMv8 and ARMv7 mappings [30].

R(X) · WSC(Y ) · R(Z) ↦→ R(X) · L(Y ) · R(Z) (C-v8) R(X) · WSC(Y ) · R(Z) ↦→ R(X) · F · W(Y ) · F · R(Z) (C-v7)

The reads are unordered in ARMv8 and may violate SC-ARMv8. The ARMv7 event sequence is ordered by fences that leads to SC-ARMv7 robustness.

The Barrier (and Peterson-RA-b) program violates SC-ARMv7 due to unordered store-load pairs, but satisfes x86 and ARMv8 robustness. Some ARMv7 programs violate SC, x86, ARMv8 robustness due to unordered read-read pairs.

#### 2) Enforcing robustness.

In most of the programs enforcing weaker model requires less number of inserted fences. However, certain ARMv8 programs (e.g. lamport-SC) incur less fences to enforce SC-ARMv8 than x86-ARMv8. Consider the ARMv8 sequence W(X) · R(X) · R(Y ) · W(Y ) that may violate SC-ARMv8 and x86-ARMv8. To ensure SC-ARMv8 we insert a DMBFULL that results in W(X) · R(X) · F · R(Y ) · W(Y ) sequence. To ensure x86- ARMv8 we insert a DMBLD and a DMBST to generate a W(X)· R(X) · FLD · R(Y ) · FST · W(Y ) sequence.

#### 3) Performance of Robustness Analyses

We have already compared the execution times of SC-x86 robustness analysis in *Fency* and Trencher. In case of ARM program versions *Fency* incurs less than a second except for ARMv7 Cilk-(TSO/SC) programs. The timings of *Fency* analyses vary among different program versions. It is because LLVM may optimize a program differently for different architectures. So the number of memory accesses (parameter 'a' in Figs. 8 and 9) and the number of memory access pairs vary. Moreover, the CFGs in different architectures also differ which affect the Pnf and Cy computations.

#### VI. RELATED WORK

SC-robustness is studied against TSO [3, 4, 5, 6, 7, 8, 9, 10], PSO [11, 12], POWER [13], and Release-Acquire [14] models by exploring possible executions using model checking tools. On the contrary, we analyze and transform programs as LLVM passes without exploring program executions.

[8] check and enforce SC-robustness for parameterized programs for any number of threads. It reduces the robustness checking problem to parameterized reachability analysis on possible executions. Instead, our approach is static and parameterized over the thread functions for any number of threads.

PORTHOS [31] checks portability of a program from one model to another, particularly from POWER to TSO by encoding models in SAT/SMT solvers. On the contrary, we check robustness or portability of ARM models which are different from POWER. In addition, our analysis enable fence insertion to enforce robustness unlike PORTHOS.

A number of approaches [32, 8, 33, 34, 35, 18, 6, 11] propose fence insertion to ensure SC. Among these fence insertion schemes our approach is closer to static approaches [34, 18, 35]. [18] use delay-set analysis to ensure SC for weak memory programs. [35] proved that identifying minimal set of fences is NP-hard and proposed minimal fence insertion based on control fow analysis. Similar to [35], we analyze control fow graph without exploring the executions.

[32] checks SC-robustness against x86 and POWER, and restore SC by inserting lock-unlock or RMW constructs. [34] proposed fence insertion in POWER to strengthen a program to release/acquire semantics which has same ordering constraints between memory accesses as TSO. On the contrary, we propose M-K robustness; we defne robustness conditions for ARMv7 and ARMv8 programs and show that ppo is not suffcient to enforce SC in ARMv7. Moreover, we analyze parameterized programs unlike these approaches.

We extend abstract event graph (AEG) from [34] and propose memory pair graph in our analyses. An AEG captures the possible execution graphs statically for a given set of threads and statically detect possible robustness-violating cycles which may occur in an execution. The proposed memory-access pair graph (MPG) also considers that the program is parameterized where each thread function may create multiple threads and hence construct the event graph on all memory access pairs from all threads. Then similar to AEG we statically detect possible robustness-violating cycles on MPG. However, our fence insertion may not be optimal; identifying optimal fence insertion is an well studied problem [35, 18, 34] which we will pursue in the context of M-K robustness.

#### VII. CONCLUSION AND FUTURE WORK

In this paper we identify robustness conditions for x86, ARMv8, and ARMv7 relaxed memory models. Based on these identifed conditions we check M-K robustness. If robustness is violated we insert appropriate fences to enforce robustness. We implement our approach as LLVM compiler passes and evaluate the effciency on a number of well-known concurrent algorithms and data structures.

Going forward we want to extend the analyses to other concurrency features in x86 and ARM models [36]. We would also like to extend these analyses to other architectures such as RISC-V [37] and Power [38].

#### REFERENCES

[1] A. Barbalace, M. L. Karaoui, W. Wang, T. Xing, P. Olivier, and B. Ravindran, "Edge computing: the case for heterogeneous-isa container migration," in *VEE'20*, 2020, pp. 73–87.


# Pruning and Slicing Neural Networks using Formal Verifcation

Ori Lahav and Guy Katz The Hebrew University of Jerusalem, Jerusalem, Israel {ori.lahav, guykatz}@cs.huji.ac.il

*Abstract*—Deep neural networks (DNNs) play an increasingly important role in various computer systems. In order to create these networks, engineers typically specify a desired topology, and then use an automated training algorithm to select the network's weights. While training algorithms have been studied extensively and are well understood, the selection of topology remains a form of art, and can often result in networks that are unnecessarily large — and consequently are incompatible with end devices that have limited memory, battery or computational power. Here, we propose to address this challenge by harnessing recent advances in DNN verifcation. We present a framework and a methodology for discovering redundancies in DNNs — i.e., for fnding neurons that are not needed, and can be removed in order to reduce the size of the DNN. By using sound verifcation techniques, we can formally guarantee that our simplifed network is equivalent to the original, either completely, or up to a prescribed tolerance. Further, we show how to combine our technique with *slicing*, which results in a *family* of very small DNNs, which are together equivalent to the original. Our approach can produce DNNs that are signifcantly smaller than the original, rendering them suitable for deployment on additional kinds of systems, and even more amenable to subsequent formal verifcation. We provide a proof-of-concept implementation of our approach, and use it to evaluate our techniques on several real-world DNNs.

#### I. INTRODUCTION

The wide-spread adoption of *deep learning* [17] has caused a signifcant leap forward in many domains within computer science. *Deep neural networks* (*DNNs*) have now become the state of the art solution for a myriad of real-world problems, such as game playing [40], image recognition [41], and autonomous vehicles [5], [25]. This trend is likely to continue and intensify, thus creating an urgent need for tools and techniques to analyze and manipulate DNNs.

A part of the appeal of DNNs is that they are produced in a mostly automated way. In order to create a DNN for a particular task at hand, engineers frst specify the network architecture — specifcally, the number of layers in the network, the size and type of each layer, and the inter-layer connections. Then, they invoke an automated training algorithm for assigning weights to the network's edges [17]. While the automated training process has been extensively studied and is generally well understood [17], the choice of network architecture is still performed according to various rules of thumb, and is considered a form of art. This can often lead to a choice of architecture that is wasteful — i.e., which results in a large DNN, whereas a smaller DNN could have achieved similar accuracy [15], [19], [23]. For DNNs intended to run on devices with limited resources (e.g., mobile phones, or embedded circuits), excessive DNN size can be a limiting factor [25].

One successful approach for mitigating this diffculty is to frst train a large network, and then shrink it by removing *redundant neurons*. Informally, we say that a neuron is redundant if removing it does not change the DNN's output; and thus, removing it from a network N results in a smaller network, N′ , that is *equivalent* to N. In order to identify redundant neurons within a DNN, prior work has focused primarily on *heuristic pruning*: heuristically identifying neurons and edges that contribute little to the network's output, removing these neurons, and then performing additional training of the network [19], [23]. These methods have been highly successful in reducing DNN sizes, but they provide no formal guarantees; i.e., the removed neurons are not guaranteed to have been redundant, and the simplifed network can thus be dramatically different from the original, producing different results for various inputs [35].

Recently, there has been a surge of interest in the formal verifcation of neural networks (e.g., [2], [14], [20], [26], [28], [32], [46], and many others). These new capabilities have made it possible to identify and remove redundancies in a network, in a way that *guarantees* that the smaller network is completely equivalent to the original [15]. Specifcally, Gokulanathan et al. showed how verifcation could be used to identify and remove "dead" neurons, i.e. neurons whose output is 0 regardless of the network's inputs. This approach was shown to reduce network sizes by up to 10%, which is quite signifcant, while preserving complete equivalence to the original network.

Here, we propose a new technique, which also attempts to apply formal verifcation in order to remove neurons from a DNN, but which is signifcantly stronger. Specifcally, our technique: (i) can identify additional kinds of redundant neurons (beyond "dead" neurons), whose removal does not affect the network's outputs at all; and (ii) can identify additional redundant neurons, whose removal *does* affect the network's outputs, but only up to a small, provable bound.

Finally, we propose a method that takes our approach to the extreme, by integrating it with *network slicing*. This method, in which a network is simplifed into a family of much smaller sub-networks, is appropriate for cases where fast inference is crucial: an input is checked to identify the appropriate subnetwork for handling it, and then only that network needs to be evaluated for that specifc input. Slicing is achieved by

partitioning the DNN's input domain into small sub-domains, maintaining a separate DNN for each input sub-domain, and then applying the aforementioned simplifcation techniques on each of these DNNs. We demonstrate that the use of small input sub-domains causes many neurons to become redundant, and consequently removable.

For evaluation purposes, we implemented our approach in an open-source, publicly available tool [33]. As a backend, our tool uses the Marabou DNN verifcation tool [29]. We note, however, that our approach is agnostic of the underlying verifcation engine — indeed, it could be integrated with any other tool, and will consequently beneft from any development in DNN verifcation technology. We evaluated our approach on a set of airborne collision avoidance networks [25], obtaining highly favorable results. Specifcally, we were able to achieve a reduction of up to 71% in overall network sizes, while keeping the outputs identical (up to a prescribed tolerance) to those produced by the original DNN. This reduction in network sizes is a signifcant improvement over the previous state of the art [15]. Further, while prior techniques were specifcally tailored to networks with only a specifc activation function (i.e., rectifed linear units [15]), our technique is applicable to multiple kinds of DNNs.

The rest of this paper is organized as follows. In Section II, we provide the necessary background on DNNs and their verifcation. Next, in Section III we present the basic building block of our approach, namely the removal of a single neuron. We then specify multiple kinds of neurons that can be removed in Section IV, and discuss the simultaneous removal of neurons in Section V. Subsequently, in Section VI we present how *input slicing* and simplifcation can be used to improve network evaluation time. An evaluation appears in Section VII, followed by a discussion of related work in Section VIII. We then conclude in Section IX.

#### II. BACKGROUND: DNNS AND THEIR VERIFICATION

A deep neural network [17] is a directed, acyclic graph, whose nodes (also referred to as *neurons*) are grouped into layers. The frst layer is the *input layer*; the fnal layer is the *output layer*; and the intermediate layers are the *hidden layers*. When the network is evaluated, the input neurons are assigned some values (e.g., sensor readings), and these values are then propagated through the network, layer by layer, until the output values are computed. In *regression* networks, the numeric value of the output is of interest, while in the case of *classifcation* networks, the output neurons correspond to possible *labels* that the network can classify the input into; and the label whose neuron obtained the highest score is the one returned by the network.

Each layer in the DNN has a type, which determines how its neuron values are computed. Here, we will focus on two types: *weighted sum* layers, and *piecewise-linear activation* layers. In a weighted-sum layer, the value of a neuron y is computed as y = b + ∑civ<sup>i</sup> for neurons v<sup>i</sup> from preceding layers, where the *weights* c<sup>i</sup> are determined when the network is frst trained. In a piecewise-linear activation layer, the value of neuron y is computed as

$$y = \begin{cases} a\_1x + b\_1 & \text{if } s\_1 \le x < s\_2, \\ a\_2x + b\_2 & \text{if } s\_2 \le x < s\_3, \\ \dots \\ a\_kx + b\_k & \text{if } s\_k \le x \le s\_{k+1} \end{cases}$$

where x is a neuron from some preceding layer, and the a<sup>i</sup> , b<sup>i</sup> and s<sup>i</sup> parameters determine the piecewise linear function being computed. A common example of a piecewiselinear activation function is the ReLU function, given by

$$y = \max(x, 0) = \begin{cases} 0 & \text{if } x < 0 \\ x & \text{if } x \ge 0 \end{cases}$$

(see Fig. 1). Together, weightedsum layers and piecewise-linear activation functions make up many common DNN architectures [17]. Typically, they are

Fig. 1: The ReLU function.

used in alternation (see Fig. 2). Extending our approach to activation functions that are not piecewise-linear remains a work in progress.

Fig. 2: An illustration of a DNN with alternating weighted-sum (WS) and ReLU layers.

More formally, we regard a DNN N with k inputs and m outputs as a mapping R <sup>k</sup> → R <sup>m</sup>. The DNN is given as a sequence of layers L1, . . . , Ln, where L<sup>1</sup> is the input layer and L<sup>n</sup> is the output layer. We use s<sup>i</sup> to denote the size of layer L<sup>i</sup> , and use v 1 i , . . . , v si i to refer to the individual neurons of L<sup>i</sup> . We use V<sup>i</sup> to refer to the column vector [v 1 i , . . . , v si i ] T . When the network is being evaluated, we assume that the input values V<sup>1</sup> are given, and that V2, . . . , V<sup>n</sup> are computed iteratively. The type of each hidden layer is given via the mapping T<sup>N</sup> : N → T . For simplicity we set T = {weighted-sum, ReLU}, although our technique applies to all types of piecewise-linear activation functions.

In a weighted-sum layer L<sup>i</sup> , each neuron v j i is associated with a linear function v j <sup>i</sup> = b j <sup>i</sup> + ∑cl,t ·v t l ; i.e., v j i is computed as a weighted-sum of neurons v t l from preceding layers l < i, plus a bias value b j i . In a ReLU layer L<sup>i</sup> , each neuron v j i is associated with a specifc neuron v t l from a preceding layer l < i, and its value is given by v j <sup>i</sup> = ReLU(v t l ) = max(v t l , 0). Note that each neuron's value depends only on neurons from preceding layers.

In recent years, various security and safety issues have been discovered in DNNs [26], [43]. This has led the verifcation community to study the *DNN verifcation problem* [36]. Generally, this problem is defned by a set of constraints P on the DNN's inputs, and a set of constraints Q on the DNN's outputs; and solving it entails fnding (or proving the nonexistence of) an input x such that P(x) ∧ Q(N(x)); i.e., an input x that satisfes the input condition, and is mapped by the DNN to a point that satisfes the output condition. When P and Q characterize an unsafe behavior of the DNN, an UNSAT answer to the aforementioned query indicates that the DNN is safe; whereas a SAT answer, accompanied by a satisfying assignment, demonstrates an unsafe behavior. This formalization is suffciently expressive for capturing many properties of interest [26]. Many approaches for solving the DNN verifcation problem have been proposed recently (e.g., [14], [20], [26], [46], and many others). The techniques we discuss in this work use a DNN verifcation engine as a backend, and do not depend on the precise method used — and so we do not elaborate on this topic. We refer the interested reader to [36] for a survey.

#### III. REMOVING A SINGLE NEURON

The core of our DNN simplifcation approach is the identifcation, and then the removal, of *redundant neurons*. Given a DNN N, we seek to identify a redundant neuron v j i , and then produce another network, N′ , which is identical to N except for the redundant neuron that has been removed. Ideally, we would like to ensure that N and N′ are equivalent; i.e., that ∀x.N(x) = N′ (x). Because N′ is obtained from N by removing a neuron, it is smaller; and this process can be repeated iteratively, to eventually obtain a signifcantly smaller network that is equivalent to N. Of course, the key points that need addressing are: (i) how to technically remove a redundant neuron from the network; and (ii) how to identify redundant neurons. In this section we focus on the frst challenge, and describe the mechanics of removing a neuron.

In order to maintain compatibility with the original network, we will refrain from removing neurons from the network's input or output layers; all other neurons are considered candidates for removal. We distinguish between neurons in weighted-sum layers, and neurons in activation function layers. In fact, our proposed approach only supports the removal of weighted-sum neurons that feed only into other weighted-sum neurons; and the removal of activation function neurons will be performed by frst transforming them into weighted-sum neurons, as described in later sections.

Consider a neuron v computed as a weighted-sum

$$v = b\_v + \sum c\_i \cdot x\_i$$

,

where x<sup>i</sup> are neurons from preceding layers. Suppose that v only feeds into other weighted-sum neurons, and let u be such a neuron:

$$u = b\_u + c \cdot v + \sum d\_i \cdot y\_i,$$

where y<sup>i</sup> are again neurons from preceding layers. In this case, u's equation can be updated into:

$$u = (b\_u + c \cdot b\_v) + \sum c \cdot c\_i \cdot x\_i + \sum d\_i \cdot y\_i.$$

If this process is repeated for every (weighted-sum) neuron that v feeds into, then afterwards v will have no outgoing edges. Consequently, v could then be eliminated from the network altogether. It is straightforward to show that such an operation will never affect the value of u, and that the modifed network will thus be completely equivalent to the original. Also, identifying neurons that can be eliminated is simple, and amounts to searching for weighted-sum neurons that are only connected to other weighted-sum neurons.

In practice, DNN topology usually alternates between weighted-sum and activation function layers, and so consecutive weighted-sum neurons are likely to be scarce. Our strategy will thus be to replace activation function neurons with weighted-sum neurons, in a way that will enable neuron removal while preserving network accuracy. As an example, let us consider a ReLU neuron, y = ReLU(x). Because of layer-type alternation, it is reasonable to assume that x is a weighted-sum neuron. In this case, if we can express y as a linear function of x, i.e. y = ax + b for some a and b, then the previous case of two consecutive weighted-sum neurons applies: we can remove x entirely, change y's type to weighted-sum, and connect y to x's inputs. Further, if y also feeds into weighted-sum neurons, then we can apply simplifcation once again, and remove y as well. An illustration appears in Fig. 3.

Fig. 3: Illustration: removing a neuron. x is a weighted-sum neuron which feeds into y, a ReLU neuron. After converting y into a weighted-sum neuron, both x and y can be removed.

The aforementioned steps constitute the framework of our approach — to repeat, until saturation, the two steps: (i) identify any weighted-sum neurons that only feed into weighted sum neurons, and remove them; and (ii) identify any activation function neurons that can be changed into weighted-sum neurons, without harming the network's accuracy. The key remaining issue is how to identify those neurons to which step 2 can be applied. We elaborate on this issue in the following sections.

#### IV. LINEARIZING ACTIVATION FUNCTIONS

We next propose various criteria for determining which activation function neuron can be changed into weighted-sum neurons. Applying these criteria in practice is discussed later, in Section V.

Phase Redundancy. In order to transform an activation function neuron into a weighted-sum neuron without changing the network's outputs, we leverage the properties of piecewise-linear functions. Let x be a weighted-sum neuron and let y = f(x) be an activation function neuron; then, by defnition, the value range of x is divided into segments [s1, s2], [s2, s3], . . . [sk, sk+1], and in each segment y is a linear function (a weighted-sum) of x. If we are able to discover that x is in fact restricted to one of these segments, i.e. s<sup>i</sup> ≤ x < si+1 for some i, then we can safely discard the constraint y = f(x) and replace it with a linear constraint y = aix + b<sup>i</sup> , thus changing y to be a weighted-sum neuron. We stress that this change does not alter the value of y, and consequently does not alter the network's outputs. When this phenomenon occurs, we say that y is *phase-redundant*. For the ReLU function, this happens if we discover that x < 0 (y is *inactive-redundant*), or x ≥ 0 (y is *active-redundant*). As previously stated, transforming the piecewise-linear constraint into a linear one will often allow us to eliminate two neurons from the network, without changing its outputs.

Forward Redundancy. Phase-redundancy captures the case where an activation function neuron is fxed to a single linear phase, for all possible inputs. However, there actually exist *unstable* activation-function neurons, i.e. neurons not fxed to a particular linear phase, which can still be soundly transformed into weighted-sum neurons computing one of these linear phases. Intuitively, this happens when neuron y's assignment affects its k succeeding layers, for some k > 0, but gets "canceled out" in layer k + 1. A small, illustrative example appears in Fig. 4. When replacing y with a weighted-sum neuron only affects neurons that are at most k layers away from y, we say that y is k*-forward-redundant*. Much like phase-redundant neurons, k-forward-redundant neurons can be removed from the network without harming its accuracy.

Fig. 4: The orange ReLU neuron, marked y, is *2-forward-redundant*. Replacing y with a constant zero affects the following WS and ReLU layers, but it does not affect the last WS layer (and thus the network output). For example, observe that if we input 1 into the network, y evaluates to 1, and the network's output evaluates to 12. This output value is unchanged even if we replace y's value with 0. A careful examination of the network reveals that this will always be the case, regardless of the network's input value.

More formally, let v j i be an activation function neuron, and let N′ be a network obtained from N replacing v with a weighted-sum neuron v j <sup>i</sup> = b j <sup>i</sup> + ∑ckxk. Let V<sup>1</sup> denote an input vector, on which both N and N′ are evaluated; and let V2, . . . , V<sup>n</sup> and V ′ 2 , . . . , V ′ <sup>n</sup> denote the layer evaluations of N and N′ (respectively) on V1. If, for every V1, it holds that Vi+<sup>k</sup> = V ′ i+k , then we say that neuron v j i is k-forwardredundant (note that this implies Vi+k′ = V ′ <sup>i</sup>+k′ for every k ′ > k). We note that a neuron that is phase-redundant is also k-forward-redundant, for any k ≥ 0.

Relaxed Redundancy. So far, we discussed replacing a piecewise-linear activation neuron with a weighted-sum neuron that corresponds to one of the activation function's linear segments; e.g., in the case of y = ReLU(x), neuron y would be changed into a weighted-sum neuron computing either y = 0 or y = x. We observe that, although these linear functions are natural candidates for replacing the original constraint, in fact any linear function y = ℓ(x) could be used. Specifcally, given an activation function y = f(x) and some known lower and upper bounds lb and ub for x (computed, e.g., using interval arithmetic [26] or abstract interpretation [14], [46]), we propose to fnd a linear function ℓ(x) that has *minimal error* compared to f(x). We defne this error to be

$$\max\_{lb \le x \le ub} |f(x) - \ell(x)|$$

See Fig. 5 for an illustration of replacing a ReLU constraint, whose phase is not fxed, with three linear constraints. In each illustration, the blue line is the ReLU, the dashed line is the linear replacement, and the red area is the introduced error. In case (c), the maximal introduced error (the height of the red region) is the smallest among the three options.

(a) Replacing a ReLU with the zero function (b) Replacing a ReLU with identity function

(c) Replacing a ReLU with an arbitrary linear function

Fig. 5: Replacing a ReLU with linear functions.

Unlike in the phase-redundancy and k-forward-redundancy cases, setting y = ℓ(x) will introduce some imprecision to the network's output. The motivation is that by replacing y = f(x) with y = ℓ(x) that has minimal error, we would be introducing only a small imprecision, while enabling the removal of y. Let e<sup>t</sup> be some user-defned error threshold; when replacing y = f(x) with ℓ(x) introduces an error e such that e ≤ et, we say that neuron y is *relaxed-redundant*.

Let us focus on the y = ReLU(x) function as an example, and suppose we know that x ∈ [lb, ub]. If lb < 0 and ub > 0, the neuron is not phase-redundant. In this case, a linear function y = lm(x) with minimal error can be easily computed, and is given by:

$$l\_m(x) = \frac{ub}{ub - lb} \cdot x + \frac{-lb \cdot ub}{2(ub - lb)}.$$

It is straightforward to check that the maximum error is obtained when x = 0, and it is given by <sup>−</sup>lb·ub 2(ub−lb) (a proof appears in Appendix 1 in the extended version of this paper [34]). Unsurprisingly, when lb or ub are close to 0, the error becomes very small — indicating that such ReLUs, which are "almost phase-redundant", could be removed at a small cost to precision. It should be noted, however, that minimizing the maximum error introduced by the removal of a single neuron does not necessarily minimize the overall imprecision introduced to the network's outputs.

Result-Preserving Redundancy. In classifcation networks, it may be acceptable to give up some precision, as long as the output label for each input is unchanged; i.e., if the original network classifed input x as label l with 80% confdence, it may be acceptable to remove neurons in a way that reduces this confdence to 60%, as long as x is still classifed as l.

More formally, let y = f(x) be an activation neuron in a network N, and let N′ denote the same network with y replaced by a weighted sum neuron, y = ℓ(x). If, for every input vector V1, it holds that argmax(Vn) = argmax(V ′ n ), i.e. if both networks classify each input vector in the same way (regardless of the actual output neuron values computed), then we say that neuron y is *result-preserving redundant*. See Fig. 6 for an example.

Fig. 6: The orange ReLU, marked y, is result-preserving redundant and can be replaced with a constant zero. Observe that any input in range (0.1, 1] is classifed as label #1, while any input in range [−1, 0.1) is classifed as label #2. The ReLU in orange is active only for inputs in (0.2, 1], and it only increases the confdence in label #1. For example, the network output for input 0.5 is [1.3, 0.3]<sup>T</sup> , and after replacing y with 0 the output becomes [1.0, 0.6]<sup>T</sup> . Label #1 still wins, but with a lower confdence. Thus, y is result-preserving redundant — replacing it with a constant zero does not change the winning class, for the entire input domain.

Note that result-preserving redundancy is, in a way, more permissive than the previous categories: we do not directly try to bound the imprecision introduced, but rather only try to maintain the same output *label* for every input. Clearly, any neuron that is phase-redundant or k-forward-redundant is also result-preserving; and it is reasonable to assume that relaxedredundant neurons with a small error would also be resultpreserving redundant. The motivation for considering this kind of redundancy is that, due to its more permissive nature, it can identify additional redundant neurons.

Our defnition of result-preserving redundancy can also be slightly relaxed, to exclude inputs whose classifcation was *borderline*; i.e., inputs whose highest-scored label and the second-highest label received very similar scores. Intuitively, with this alteration, a neuron is considered result-preserving redundant if it does not change the classifcation of any inputs which were previously classifed with a high degree of confdence, but may fip the classifcation of inputs about which the DNN was not sure to begin with. The motivation for this change is to allow the removal of additional neurons.

#### V. NEURON REMOVAL STRATEGIES

In Section III we laid the theoretical foundations of our DNN simplifcation approach, by defning four kinds of redundant neurons that could be removed to reduce network size. There exist many strategies for applying these defnitions in practice, in order to reduce network sizes. Intuitively, a good strategy is one that identifes large sets of neurons that can be removed simultaneously, in a way that is computationally effcient. In this Section, we propose one such strategy, which we have empirically observed to perform well.

Step 1: Bound Estimation using MILP. Let v be an activation function neuron which we are considering for removal. In this context, it is useful to deduce lower and upper bounds for v that are as tight as possible. Such bounds could lead, for example, to the classifcation of v as phase-redundant, or enable us to compute lm(v) and declare v to be relaxedredundant.

Mixed-Integer Linear Programming (MILP) [9] is a wellstudied method for solving a system of linear constraints with real and integer variables. In the context of DNN verifcation, MILP can be used to derive lower and upper bounds on the values that the various neurons in the DNN can obtain [10], [44]. This is done by encoding a linear over-approximation of the neural network into the MILP solver, and then using the solver's objective function to maximize/minimize each of the individual neurons. For example, after encoding a network N, we could set the solver's objective function to 1·v, where v is some neuron in N; and the optimal solution discovered would then constitute v's upper bound.

As a frst step in the simplifcation process, we propose to run such MILP queries for every neuron that is candidate for removal. The number of resulting queries can be large two queries per neuron, one for each bound — but the gains are signifcant, as the discovered bounds can often be quite tight [44]. At the end of this step, we immediately remove all phase-redundant neurons.

In practice, it is useful to run the MILP solver with a short timeout (e.g., 10 second) for each neuron. In case a timeout occurs, modern solvers are able to provide a sound approximation of the optimal solution [38]. In our experiments, we observed that this initial step already detects a large number of phase-redundant neurons.

Step 2: Simulations. After the MILP phase is concluded, we are left with multiple activation-function neurons whose phases are not yet fxed. It is possible that some of these neurons are also phase-redundant, but that the bounds discovered in the MILP pass were too loose to indicate this. It is also possible that they are k-forward-redundant or result-preserving redundant. At this point we wish to quickly *rule out* as many of these candidates as possible, before applying computationally expensive steps to dispatch the remaining candidates.

To do this, we follow in the footsteps of Gokulanathan et al. [15], and apply *simulations*; i.e., we evaluate the network on a large number of random inputs, and for each input record the values assigned to the network's neurons. Simulations can easily show that a neuron is not phase-redundant, by demonstrating two different inputs for which the neuron is in two different linear phases. Similarly, they can show that a neuron is not k-forward-redundant or result-preserving redundant.

Step 3: Formal Verifcation. After the MILP and simulation phases, we are left with activation-function neurons that are candidates for removal, if we can prove them redundant. We now apply formal verifcation to classify these remaining neurons. Specifcally, for each candidate neuron v, we: (i) apply verifcation to check whether v is fxed to one if its linear phases, and is hence phase-redundant; and if not, (ii) if N is a classifcation network, apply verifcation to check whether v is result-preserving redundant; else, if N is a regression network, apply verifcation to check whether v is k-forward-redundant, for a value of k that corresponds to the output layer. Each of these conditions can be posed as a DNN verifcation query, as described next. As soon as a neuron is marked redundant, it is removed, and the process continues.

In order to determine whether v = f(x) is phase-redundant, we must check whether x is restricted to a certain linear segment. Let [s1, s2], [s2, s3], . . . [sk, sk+1] be the set of possible segments. For each such segment [s<sup>i</sup> , si+1], we can encode the DNN into the verifer, and pose the query: ∃V1.(x < si)∨(x > si+1). If the answer is UNSAT, we know that x is indeed fxed into segment [s<sup>i</sup> , si+1]. An illustration appears in Fig. 7.

Fig. 7: A query for determining whether ReLU node v = ReLU(x) is *phase-redundant*: we check whether it is possible that x > 0, and if not, we conclude that v is inactive-redundant. To facilitate the verifcation process, the neurons in subsequent layers, as well as all other neurons in layer 2 (grayed out), are not encoded.

Determining whether v = f(x) is k-forward-redundant is done by creating a query where the part of the network starting from the neuron in question is duplicated. One copy of the network is the unmodifed one, and in the other copy v = f(x) is replaced with a weighted-sum neuron, v ′ = ℓ(x). We query the verifer whether it is possible that a neuron k layers away from v is assigned different values in the original and modifed copies. If the answer is UNSAT, the neuron is kforward-redundant. See Fig. 8 for an illustration.

Fig. 8: 4-Forward-Redundancy query illustration. The neuron in orange is the neuron being checked for forward-redundancy. In this case the layer being checked is at distance 4, which happens to be the output layer.

Determining whether v = f(x) is result-preserving redundant is done by creating a query similar to the k-forwardredundant case, only this time we ask the verifer whether there exists an input that the two networks classify differently. If the answer is UNSAT, we know that the neuron is indeed result-preserving redundant.

Step 4: Relaxed Redundancy and Accumulative Error. The aforementioned steps were aimed at identifying and removing redundant neurons, without introducing any imprecision into the simplifed network. Last but not least, we discuss the removal of relaxed-redundant neurons. Recall that relaxedredundant neurons are determined by a user-specifed error threshold et. Identifying these neurons is thus a local operation, that does not require verifcation; for every neuron we can compute the maximum error introduced by replacing it with lm, and see whether it exceeds the threshold.

While each relaxed-redundant neuron can be identifed locally, removing multiple neurons simultaneously runs the risk of compounding the overall error, beyond the permitted threshold. To circumvent this issue and allow the effcient removal of multiple relaxed-redundant neurons, we introduce the following lemma:

Lemma 1. *Let* N *be a neural network, and let* N′ *be a simplifed network, obtained from* N *by removing relaxedredundant neurons* u1, . . . , un*. Consider another neuron* v *in* N′ *that is relaxed-redundant, and let* ein *denote the error to* v*'s* input*, previously introduced by the removal of* u1, . . . , un*. Let* e<sup>v</sup> *denote the error introduced by the removal of* v*. Then, if we remove* v*, the overall error introduced to its* output *is* *upper bounded by:*

$$e\_{in} + e\_v$$

This lemma tells us that the iterative removal of relaxedredundant neurons does not compound the introduced error; instead, the error introduced by the removal of each neuron is only added to the error already introduced by the removal of other neurons. This enables us, through a straightforward computation, to upper bound the overall imprecision (on the output layer) that the removal of a set of relaxed-redundant neurons might cause. Consequently, our proposed strategy is to begin removing relaxed-redundant neurons with small error rates, each time recomputing the overall network inaccuracy, until hitting the prescribed overall error threshold. A full, formal description of these claims appears in Appendix 2 in the extended version of this paper [34].

#### VI. INTRODUCING REDUNDANCIES VIA INPUT SLICING

So far, our simplifcation efforts have hinged on the existence of redundant neurons. Next, we introduce a technique that can cause neurons to become redundant, even if they are initially not so.

The core idea is to: (i) *slice the input domain* D of the DNN N into smaller sub-domains D1, . . . , Dn; (ii) duplicate the original network n times, resulting in networks N1, . . . , Nn, such that network N<sup>i</sup> is associated with domain D<sup>i</sup> ; and (iii) apply the simplifcation process described in Section V for each N<sup>i</sup> , separately. Intuitively, splitting the input domain into sub-domains can serve to separate "simpler" inputs regions, in which many neurons are phase-redundant, from more "complex" input domains where neurons fuctuate between phases. Various heuristics can be used for splitting the input domain, depending on the network in question. A simple splitting method, which we used in our evaluation, is to split the range of each input coordinate into n even sub-ranges.

After the slicing and simplifcation is done, we are left with a family of DNNs N1, . . . , Nn, which are together equivalent to the original N. Evaluation is then performed in two steps: given an input vector V1, we frst identify the domain D<sup>i</sup> to which V<sup>1</sup> belongs; and then compute Ni(V1) and return the result. As our evaluation shows, the resulting N<sup>i</sup> networks can be quite small, resulting in a signifcant improvement to the expected number of operations required for evaluating the network. This improvement might come at the expense of increased space requirements for storing the resulting family of networks, making this approach suitable for cases where space is abundant but fast inference is crucial. We note that, as a side effect, the resulting networks may be easier to verify [46], [48].

Discussion: Dependency on Input Dimensions. Our proposed slicing method relies on splitting the input domain, by restricting input neurons to various values. This approach works quite well on DNNs with relatively few input neurons (e.g., the ACAS Xu family of networks [25]; see Section VII for details). For networks with a larger number of input neurons (e.g., image recognition networks), the number of input sub-domains might be prohibitively large. Indeed, a similar phenomenon has been observed for verifcation techniques that rely on input slicing [46], [48].

One approach for mitigating this diffculty is through performing slicing not on the input layer, but on some smaller intermediate layer L<sup>k</sup> in the network. Then, the network would be evaluated by evaluating the original network's layers L<sup>1</sup> . . . Lk−1, and then using the values computed for layer L<sup>k</sup> in choosing from a set of networks for continuing the evaluation. We speculate that for an intermediate layer of a moderate size, this approach could lead to improved performance over input slicing. We leave this for future work.

Extreme Slicing: Complete Linearization. We observe that input slicing can be used to completely linearize every subdomain of the input space; that is, if the resulting subdomains are suffciently small, then in each network N<sup>i</sup> all activation functions will become phase-redundant, effectively collapsing the DNN into a linear transformation. Additionally, even if the slicing does not fx the phase of all activation function neurons, extreme slicing tends to decrease the error introduced by removing relaxed-redundant neurons; and thus, complete linearization could be achieved by removing these neurons, even if they have not become phase-redundant. This linearization approach can thus be regarded as providing us with a simple, piecewise-linear approximation of the network as a whole — with an upper bound on the error in each subdomain. Our experimental results in Section VII demonstrate very low error rates on most sub-domains.

Complete linearization incorporates a trade-off: in order to obtain very small, nearly-linear networks, the input domain would have to be sliced many times. Users can fne-tune the number of slices used, and consequently the sizes of the resulting DNNs, to their specifc needs.

### VII. EVALUATION

We created a proof-of-concept implementation of our approach as a Python framework, available online [33] (together with all benchmarks reported in this section). The framework provides all the functionality discussed so far: after importing a network, it can run MILP queries to compute neuron bounds; perform simulations; and identify phase-redundant, k-forwardredundant and result-preserving redundant neurons, by running verifcation queries. The framework uses the Gurobi [38] MILP solver and the Marabou [29] DNN verifcation engine as backends, although other backends could also be used.

For evaluation purposes, we conducted extensive experiments on the ACAS Xu system: an airborne collision avoidance system, implemented as a family of 45 neural networks [25]. Each of these neural networks has 5 input neurons, 5 output neurons, and 6 hidden layers with 50 neurons each and ReLU activation functions (310 neurons in total). Keeping the network sizes small was a key consideration in developing the ACAS Xu system [25], making it a prime candidate on which to apply simplifcation techniques.

We began by comparing our approach to that of Gokulanathan et al. [15], which is the current state-of-the-art in verifcation-based simplifcation of DNNs. Their technique can be regarded as a private-case of ours, in which only specifc phase-redundant neurons (specifcally, inactive-redundant ReLUs) are removed. We compared that approach to our framework, confgured to identify and remove both activeredundant and inactive-redundant ReLUs, and also to remove relaxed-redundant neurons. We ran both tools on all 45 ACAS Xu networks; the results appear in Table I.

TABLE I: Phase-Redundancy and Relaxed-Redundancy on ACAS Xu networks.


The table depicts the accumulated numbers of redundant neurons, when read from left to right (which is the order in which the techniques were applied). First, inactiveredundant neurons are removed (this is the technique of [15]), accounting for 4% of all neurons in the network. Activeredundant neurons are next, removing another 0.2% of all neurons, which is a 3.5% increase in the number of removed neurons. Finally, relaxed-redundant neurons are removed, with three possible alternative ϵ values. The most permissive one, ϵ = 10<sup>−</sup><sup>2</sup> , leads to the removal of 4.9% of the neurons in total, which is a 21.5% increase over the baseline — but the resulting network error bound in this case, 525.1, is quite high. ϵ = 10<sup>−</sup><sup>3</sup> appears a better choice, with a total removal rate of 4.6% and a signifcantly smaller error bound of 2.64. We note that our evaluation indicates that the output error bounds currently computed are far from tight; devising tighter bounding schemes is a work in progress.

In our second experiment, we evaluated our complete simplifcation pipeline. First, we applied input-slicing, dividing the input domain into 32,768 equal sub-domains (3 rounds of bisecting the range of each of the 5 input neurons in 2). Next, for each sub-domain we: (i) ran MILP and removed any discovered phase-redundant neurons; (ii) ran simulations, and then formal verifcation to discover and remove any remaining phase-redundant neurons; and (iii) identifed all resultpreserving neurons, and greedily attempted to simultaneously remove large sets thereof, using verifcation. We note that identifying the largest set possible of result-preserving neurons that can be removed simultaneously is a diffcult problem, and our current heuristic was a simple, greedy approach. Devising more sophisticated heuristics is left for future work.

We ran the MILP step on all 32,768 sub-domains, which resulted in the discovery of 67.3% phase-redundant neurons on average in each sub-domain. We continued to run the pipeline on a sample of 50 sub-domains selected at random. Most notably, we observed an average removal of *82.5%* *redundant neurons* (out of all neurons in the network), with 7.2% additional neurons still candidates for removal, but for which the underlying verifcation engine timed-out. Of the 82.5% removed neurons, 70.2% were phase-redundant, which is a very signifcant increase from the 4.2% neurons removed when the pipeline was run over the entire input domain. This demonstrates the high effectiveness of input slicing. In addition, about 21% of phase-redundant neurons were activeredundant, which signifes the importance of the generalization from "dead neurons" [15] to phase-redundancy. The remaining 12.3% neurons removed were result-preserving redundant. Fig. 9 shows the breakdown.

Fig. 9: Redundant neuron removal, averaged over 10 ACAS Xu input sub-domains.

Slicing is highly benefcial for neuron removal, but results in a large number of sub-domains that need to be checked. Within our pipeline, verifcation steps are the most expensive, whereas MILP queries and simulations are relatively cheap. We observe, however, that MILP queries already account for most of the removed neurons. Specifcally, 68.5% of all phaseredundant neurons removed were discovered through MILP (about 83% of all redundant neurons), with a 10 second timeout for each individual MILP query.

The next step, namely simulations, is also computationally cheap and highly effective. For each sub-domain, we ran 100,000 simulations; and out of the of 31.5% neurons which were still candidates for removal after the MILP phase, an average of 26.4% of the neurons were ruled not phaseredundant through simulations. This left only a small number of candidates to be dispatched through verifcation (5.1% of the neurons), which in turn discovered the remaining 1.7% redundant neurons, on average. In our experiment, each Marabou verifcation query was run with a 4-hour timeout.

As discussed above, we used a fairly na¨ıve strategy for discovering result-preserving redundant neurons. Specifcally, we ran formal verifcation on each candidate neuron to check whether it was individually result-preserving redundant; this resulted in a set of candidates for removal. Then, we ran result-preserving simulations, iteratively removing additional candidate neurons from the network, as long as the simulations could not fnd a counter-example to the redundancy of the currently removed set. Finally, we ran a single verifcation query to verify that removing our selected neurons was indeed a result-preserving operation. On 75% of the sub-domains checked, this strategy worked. In sub-domains where we were successful, we found an additional 24.6% forward-redundant and result-preserving redundant neurons; whereas in subdomains where we were not successful, we had a similar amount of candidates for removal on average.

In the fnal step of our experiment, we tested our hypothesis that slicing can lead to the complete linearization of some of the sub-domains. Indeed, for some of the sub-domains explored, the simplifcation pipeline was able to remove *all* neurons, resulting in a DNN that is effectively a linear transformation. We noticed, however, a high variability for example, in another sub-domain we were only able to remove 58% of the neurons. See Fig. 10 for additional details. We conclude that there is an inherent difference between the sub-domains: apparently, some of them compute simpler transformations than others.

Fig. 10: An "almost" linear sub-domain (left) vs. a complex subdomain (right).

#### VIII. RELATED WORK

The pruning of DNNs in order to reduce their sizes has received signifcant attention from the machine learning community in recent years. The most common approaches are based on heuristically identifying neurons and edges that seem to contribute little to the network's output, removing these neurons and edges, and performing additional training of the network [19], [23]. Other approaches apply quantization: by using fewer bits to store the network's weights or activation functions, the DNN's footprint is decreased [21], [22], [39]. A common trait of these approaches is that, while they achieve a signifcant reduction in memory, they provide no guarantees about the resemblance of the smaller network to the original.

The most closely related work to our own is that of Gokulanathan et al. [15]. There, the authors use formal verifcation to remove dead neurons from a network, ensuring that the resulting network is equivalent to the original. Additionally, simulations are used to reduce the number of verifcation queries that need to be dispatched. Our work uses similar principles, but signifcantly extends them: we consider additional kinds of redundancy (phase-redundancy, k-forward-redundant, and result-preserving redundancy) that produce equivalent networks, and also relaxed-redundancy which removes additional neurons by introducing a bounded amount of imprecision.

Our work uses the Marabou DNN verifcation engine as a backend [1], [7], [13], [18], [27], [29], [30], [42]; but any of the many approaches and tools that have been proposed in recent years could be used as well. These approaches leverage SMT solvers (e.g., [20]), based on LP and MILP solvers (e.g., [6], [11], [37], [44]), the propagation of symbolic intervals and abstract interpretation (e.g., [14], [45]–[47]), abstraction-refnement techniques (e.g., [3], [12]), and many others. Recent work has extended beyond answering yes/no questions about DNNs, targeting tasks such as automated DNN repair [16], [31] and quantitative verifcation [4]. Verifcation approaches have also been proposed for recurrent networks [24], [49], which could potentially also be simplifed. As DNN verifcation technology improves, the scalability of our approach will also increase.

#### IX. CONCLUSION AND FUTURE WORK

Neural networks often suffer from a high degree of redundancy, which affects evaluation time, memory footprint and verifcation costs. In this paper we presented a novel technique to identify and remove such redundancy. Our framework is customizable, allowing users to safely trade network precision for size reduction, while maintaining the introduced imprecision within a prescribed bound.

In the future, we plan to extend our work along multiple axes. Specifcally, we plan to research more intelligent techniques for input domain slicing than coordinate-splitting; and also compositional techniques that would allow us to split the network into several sub-networks, identify redundancies in each of them, and then re-combine the pruned network into a single network that is smaller than the original. In addition, we plan to explore ways of combining our pruning techniques with techniques from the related feld of Boolean circuit simplifcation [8].

Acknowledgements. We thank Ittai Rubinstein and Haoze Wu for their contributions to this project. The project was partially supported by the Israel Science Foundation (grant number 683/18) and the Binational Science Foundation (grant number 2017662).

#### REFERENCES


# Towards Scalable Verification of Deep Reinforcement Learning

Guy Amir, Michael Schapira and Guy Katz The Hebrew University of Jerusalem, Jerusalem, Israel {guyam, schapiram, guykatz}@cs.huji.ac.il

*Abstract*—Deep neural networks (DNNs) have gained significant popularity in recent years, becoming the state of the art in a variety of domains. In particular, deep reinforcement learning (DRL) has recently been employed to train DNNs that realize control policies for various types of real-world systems. In this work, we present the *whiRL 2.0* tool, which implements a new approach for verifying complex properties of interest for DRL systems. To demonstrate the benefits of *whiRL 2.0*, we apply it to case studies from the communication networks domain that have recently been used to motivate formal verification of DRL systems, and which exhibit characteristics that are conducive for scalable verification. We propose techniques for performing k-induction and semi-automated invariant inference on such systems, and leverage these techniques for proving safety and liveness properties that were previously impossible to verify due to the scalability barriers of prior approaches. Furthermore, we show how our proposed techniques provide insights into the inner workings and the generalizability of DRL systems. *whiRL 2.0* is publicly available online.

### I. INTRODUCTION

In recent years, *deep neural networks* (DNNs) [23] have become highly popular due to their ability to produce state-ofthe-art results in multiple fields, e.g., image recognition [34], text classification [37], game playing [45], and many others [7]. DNNs used in such contexts have been shown to successfully learn, by training on data, a model that *generalizes* to previously unseen inputs. In particular, *deep reinforcement learning* (*DRL*) [40] has been recently used to train DNNs to learn control policies for complex computer and networked systems, surpassing the state-of-the-art in a variety of application domains, including database management [60], compiler optimization [41], congestion control [27], [39] on the Internet, routing [53], compute-resource scheduling [9], [42], adaptive video streaming [38], [43], and many more.

Despite the overwhelming success of DNNs, many safety issues pertaining to them have been identified [22], [51], demonstrating that although DNN models potentially yield excellent performance, they also suffer from many weaknesses. For instance, it has been shown that DNNs can be manipulated into performing severe errors through only slight distortions to their inputs [17]. This phenomenon, called *adversarial perturbations*, plagues effectively all modern DNNs.

Adversarial perturbations, alongside other safety and security vulnerabilities, have brought about a surge of interest in formally verifying the correctness of DNNs. A plethora of approaches for DNN verification have been proposed in recent years (e.g., [19], [25], [30], [55]). Unfortunately, in general, all proposed tools face significant scalability barriers, which render them unable to verify state-of-the-art, industrial DNNs with millions of parameters. Furthermore, even when applied to small DNNs, these tools are often restricted to verifying simplistic properties. The scalability challenge is further aggravated in the DRL context, which involves *sequential* DNNinformed decision making, and so reasoning about repeated invocations of the DNN, where the outcome of one invocation can influence the input to the DNN in subsequent invocations. Consequently, the applicability of recently introduced DNN verification tools to complex properties and systems of practical interest remains extremely limited.

To begin bridging this gap, we previously introduced a tool called *whiRL 1.0* [16], which enables verifying certain safety and liveness properties, or identifying violations, for practical DRL systems. We demonstrated *whiRL 1.0*'s usefulness by verifying properties of interest for three systems from the *communication networking* domain. We identified such systems to be prime candidates for verification for two main reasons: first, state-of-the-art DNNs in this domain tend to be of moderate sizes, which are within reach of existing verification technology; and second, meaningful and complex specifications can be formulated and verified because the inputs for these systems are carefully handcrafted and reflect important semantic meaning (as opposed to raw pixel data in computer vision applications, for example). *whiRL 1.0*, which combines DNN verification techniques with bounded model checking, uses a black-box DNN verification engine as a backend, and can thus benefit from any future improvements to DNN verification technology. As exemplified by our promising initial results in [16], *whiRL 1.0* constituted a first step towards enhancing the reliability of DRL systems.

Still, *whiRL 1.0* had severe limitations: most notably, although it successfully generated violations of desired properties, it was incapable of proving that properties of practical significance held without making very strong assumptions, e.g., that runs of the considered system terminate within a very small number of steps. However, the executions of real-world systems are often infinite, or finite but consisting of many steps. In such scenarios, *whiRL 1.0* and other DRL verification tools are unable to prove that most relevant properties hold.

In this work, we present *whiRL 2.0* [1] — a verification engine for DRL systems. *whiRL 2.0* significantly extends the capabilities of the original *whiRL 1.0* tool to accommodate verifying complex properties. In particular, while *whiRL 1.0*

was limited to verifying basic safety properties, *whiRL 2.0* utilizes *k-induction* techniques for proving both safety and liveness properties of DRL systems. In addition, *whiRL 2.0* uses *invariant inference* techniques to quickly prove properties that could otherwise be quite difficult to verify. *whiRL 2.0* also incorporates *abstraction* methods for providing some visibility into the DRL system's operation. We demonstrate the effectiveness of these techniques by revisiting the three case studies involving state-of-the-art DRL systems to which *whiRL 1.0* has been applied in [16]: the *Aurora* [27] Internet congestion controller, the *Pensieve* [43] adaptive video streamer, and the *DeepRM* [42] compute resource scheduler. We are able to prove various properties of these systems that, to the best of our knowledge, were beyond the reach of prior state-of-the-art tools, including the original *whiRL 1.0* tool.

The rest of this paper is organized as follows. Section II covers basic background on DNNs, DRL systems, and DNN verification. Next, in Section III we present our *whiRL 2.0* verification tool, and describe its novelties and main components. We present *whiRL 2.0*'s semi-automated invariant inference in Section IV, and discuss the tool's implementation in Section V. Our case studies are described in Section VI, followed by related work in Section VII. We conclude in Section VIII.

#### II. BACKGROUND

### *A. Deep Neural Networks and Deep Reinforcement Learning*

A deep neural network (DNN) [23] is a directed graph, where the nodes (also called neurons) are organized in layers. In feed-forward DNNs, data flows from the first (*input*) layer, onto a sequence of intermediate (*hidden*) layers, and finally into a final (*output*) layer. The network is evaluated by assigning values to the input layer's neurons, and then iteratively computing the assignment of each of the hidden layers, until reaching the output layer and returning its evaluation to the user.

More specifically, the value of each neuron in the hidden and output layers is computed using the values of neurons in the preceding layer. Each such layer has a *type*, which determines the exact way in which its neuron values are computed. One common layer type is the *weighted sum* layer, in which each neuron is computed as an affine combination of the values of neurons in the preceding layer, based on edge weights and bias values determined as part of the DNN's training process. Another popular layer type is the *rectified linear unit* (*ReLU*) layer, where each node y is connected to a single node x from the preceding layer, and its value is computed by y = ReLU(x) = max(0, x). In this paper we will focus on weighted sum and ReLU layers, although there exist many additional layer types, such as *max-pooling* and *hyperbolic tangent*, to which our technique may be extended.

Fig. 1 depicts a toy DNN comprising an input layer with two neurons, followed by a weighted sum layer and a ReLU layer. For input V<sup>1</sup> = [1, 3]<sup>T</sup> , the second layer's computed values are V<sup>2</sup> = [18, −3]<sup>T</sup> . In the third layer, the ReLU functions are applied, resulting in V<sup>3</sup> = [18, 0]<sup>T</sup> . Finally, the network's single output is V<sup>4</sup> = [54].

Fig. 1: A toy DNN. The values above the edges are weights, and the values below the vertices are biases.

Formally, a DNN N that receives k inputs and returns n outputs is a mapping R <sup>k</sup> → R <sup>n</sup>. The DNN consists of a sequence of m layers L1, . . . , Lm, where L<sup>1</sup> is the input layer and L<sup>m</sup> is the output layer. We use s<sup>i</sup> to denote layer Li's size, and v 1 i , . . . , v si i to denote Li's individual neurons. We refer to the column vector [v 1 i , . . . , v si i ] T as V<sup>i</sup> . During evaluation, the input values V<sup>1</sup> are fed to the network's input layer, and V2, . . . , V<sup>n</sup> are computed iteratively.

Each weighted sum layer L<sup>i</sup> has a weight matrix W<sup>i</sup> of dimensions s<sup>i</sup> × si−<sup>1</sup> and a bias vector B<sup>i</sup> of size s<sup>i</sup> . These W<sup>i</sup> and B<sup>i</sup> are set at training time, and determine how V<sup>i</sup> is computed: V<sup>i</sup> = W<sup>i</sup> · Vi−<sup>1</sup> + B<sup>i</sup> . For a ReLU layer L<sup>i</sup> , the values of V<sup>i</sup> are computed by applying the ReLU to each individual neuron in its preceding layer: v j <sup>i</sup> = ReLU(v j i−1 ).

In *deep reinforcement learning* (*DRL*) [40], a DNN, called the *agent*, learns a *policy* π, which maps each possible observed *environment state* s to an *action* a. During training, at each discrete time-step t ∈ 0, 1, 2..., a *reward* r<sup>t</sup> is displayed to the agent, based on the action a<sup>t</sup> it chose to perform after observing the environment's state at that time st. This reward is used for tuning the agent DNN's weights. The DNNs produced using DRL fall within the same general architecture described above; the difference lies in the training process, which is aimed at generating a DNN that computes a mapping π that maximizes the *expected cumulative discounted return* R<sup>t</sup> = E -P t γ t · r<sup>t</sup> . The *discount factor*, γ ∈ - 0, 1 , controls the effect that past decisions have on the total expected reward.

#### *B. Verification of Deep Neural Networks*

A DNN verification query typically includes a DNN N, a pre-condition P on N's input, and a post-condition Q on N's output [28]. The verification algorithm's goal is to find a concrete input x<sup>0</sup> such that P(x0) ∧ Q(N(x0)) (the SAT case), or prove that no such x<sup>0</sup> exists (the UNSAT case). Typically, we use the pre-condition P to express some states of the environment that the network might encounter, and use the post-condition Q to encode the *negation* of the behavior we would like N to exhibit in these states. Thus, when the verification algorithm returns UNSAT, this implies that the desired property always holds. Conversely, a SAT result indicates that the desired property does not always hold, and this is demonstrated by the discovered counter-example x0.

For example, observe the toy DNN in Fig. 1, and suppose we wish to verify that the DNN's output is strictly larger than 5, for any input, i.e., for any x = hv 1 1 , v<sup>2</sup> 1 i, it holds that N(x) = v 1 <sup>4</sup> > 5. This is encoded as a verification query by choosing a pre-condition which does not restrict the input, i.e., P = (true), and by setting Q = (v 1 <sup>4</sup> ≤ 5), which is the *negation* of our desired property. For this verification query, a sound verifier will return SAT, and a feasible counter-example such as x = h0, −1i, which produces v 1 <sup>4</sup> = 0 ≤ 5. Hence, the property does not hold for this DNN.

Verifying DRL Systems. Beyond the general challenges of verifying DNNs (most notably, scalability), verifying DRL systems involves additional challenges. These challenges stem from the fact that DRL agents typically run within reactive systems, and are invoked multiple times, with the inputs to each invocation usually affected by the outputs of previous invocations. This means that (i) the specifications for DRL systems need to account for multiple invocations; and (ii) the scalability issue is aggravated, because the verifier needs to consider multiple consecutive invocations of the network, which is akin to considering a significantly larger DNN.

While attempts have been made to develop tools tailored for DRL system verification (e.g., [16], [32], [44]), two important challenges have yet to be addressed. First, existing verification approaches for DRL systems have focused on refuting properties, and not on proving that they hold; and second, existing approaches were not geared towards verifying reactive systems. As part of the *whiRL* project, we make an initial attempt at addressing these two challenges.

### III. *whiRL 2.0*

Our contribution in this paper is the *whiRL 2.0* verification tool, which significantly extends our existing DRL verification engine, *whiRL 1.0*. The *whiRL 2.0* tool allows to verify complex queries on DRL systems, which were previously beyond our reach. Specifically, it supports the verification of safety and liveness properties of DRL systems using a *k-induction*-based approach. Additionally, it incorporates *invariant inference* techniques, which facilitate the verification of complex safety properties. *whiRL 2.0* uses an underlying verification engine as a black-box, and is hence compatible with many existing DNN verifiers.

Formalizing DRL Agents. DRL agents typically operate within reactive systems: they process a (possibly infinite) sequence of states, each representing a current snapshot of the environment observed by the agent. Each state is obtained from its predecessor by triggering the action outputted by the DRL agent, and allowing the environment to react.

In line with the formulation proposed in [16], we formalize the DRL verification problem by encoding the DRL system, as well as its environment, into a transition system T = hS, I, Ti. Each state s ∈ S in this transition system is a snapshot of the current observable environment; these states correspond to the inputs of the DNN agent. We use I ⊆ S to denote the set of initial states. The transition relation, T ⊆ S × S, is defined such that hx<sup>i</sup> , x<sup>j</sup> i ∈ T iff the system can transition from state xi to state x<sup>j</sup> ; i.e., when the DNN is presented with state x<sup>i</sup> , it selects some action, to which the environment can respond in a way that leads the system to state x<sup>j</sup> . Although the DNN is deterministic, the environment is not necessarily so, and so T need not be deterministic. An *execution* of the system is defined as a sequence of states x1, . . . , xn, such that x<sup>1</sup> ∈ I, and for all 1 ≤ i ≤ n−1 it holds that T(x<sup>i</sup> , xi+1). The process of encoding a DRL system as a transition system is supported by *whiRL 1.0*, via constructs for representing features common to DRL systems (e.g., inputs in the form of a "sliding window" over the recent history of observations) [16].

Example. As a running example, we focus on the *Aurora* DRL system [27], which implements a congestion control policy. In today's Internet, different services (e.g., video streaming like Netflix and Amazon, VoIP services such as Skype) contend over the same network bandwidth, with aggregate demand for bandwidth often exceeding the available supply. If Internet traffic sources do not pace the rates at which their data is injected into the network, the network will become congested, resulting in data being lost or delayed, and, consequently, in bad user experience and even global Internet outages. Congestion control is the task of determining, for each individual Internet traffic source, how quickly its traffic should be injected into the network at any given point in time. Congestion control is thus a both fundamental and timely networking challenge.

Recently, researchers have proposed employing DRL for this purpose, and presented the Aurora congestion controller [27]. An Aurora-controlled traffic source uses a DNN to select the next rate at which to send traffic, based on observations regarding the implications of its past choices of sending rates. Specifically, Aurora's inputs are t vectors v−t, . . . , v−1, containing performance-related statistics pertaining to the sender's most recent t rate-change decisions. These incorporate information about what fraction of sent data packets were lost following each rate selection, how long it took the sent packets to reach the traffic's destination, etc. The DNN's output determines whether the current rate should be increased, kept steady, or decreased. Changing the sending rate can potentially affect the environment, e.g., an increase to the rate might lead to packet loss if the new rate exceeds network capacity. These changes to the environment, in turn, affect the future inputs to the DNN. See [27] for additional details.

In the formulation of Aurora as a verification challenge in [16], each state, which corresponds to a possible input to Aurora's DNN, is represented by a t-tuple of statistics vectors. The state also contains the DNN's (deterministic) output for the input it represents. This is required for defining good and bad states, as will be discussed later. Congestion controllers are expected to converge to "good" rate decisions from any starting point. Hence, we let the set of initial states be the set of all states. Recall that the input to the DNN represents a sliding window over t-long histories of statistics vectors. Thus, for each two consecutive states, s<sup>1</sup> T → s2, it holds that s<sup>2</sup> is obtained from s<sup>1</sup> by augmenting the vectors in s<sup>1</sup> with a statistics vector associated with the DNN's rate change at state s1, and discarding the vector in s<sup>1</sup> corresponding to the least recent of the t prior rate changes.

DRL System Specifications. Once the DRL system is formulated as a transition system, we can specify safety and liveness properties [11] that it should uphold. *Safety properties* indicate that the system never displays unwanted behavior, and these are often formulated through a predicate PB(s) that returns true iff s ∈ S is a bad state, i.e., a state in which the property is violated. The safety verification problem then boils down to determining whether there is a reachable bad state in T [4]. *Liveness properties* indicate that the system eventually displays desirable behavior, and these are often formulated through a predicate PG(s) that returns true iff s ∈ S is a good state, i.e., a state in which the property is fulfilled. Verifying a liveness property is performed by checking that there are no infinite sequences of consecutive states in which only finitely many of the states are good [4]. For instance, a natural safety property with respect to Aurora is that when Aurora observes excellent network conditions (no packet loss, close-to-minimum packet delays), as reflected by the statistics vectors fed to the DNN, the DRL agent does not advise to decrease the sending rate in the *next time-step*. An example of a liveness property in this setting is that if excellent network conditions persist, Aurora should always *eventually* increase the sending rate.

K-Induction. Proving that safety or liveness properties hold (or finding counter-examples) involves traversing large transition system graphs. For modern DRL systems, this is often infeasible, in particular because the rich environments in which these systems operate can react in many ways after each action taken by the agent, resulting in high (or even infinite) out degrees for many states. In *whiRL 1.0*, this issue was addressed through the application of *bounded model checking* (BMC), an approach that explores only a small fraction of the transition system graph, namely, states within a k-step distance from an initial state. BMC can find safety and liveness violations (if they are reachable within k steps) as depicted in Fig. 2, but cannot prove the absence of such violations.

Fig. 2: BMC searches for violations of a safety property. Each vector represents a state, and encodes the statistics that Aurora observed in the past t = 5 time-steps. The unwanted state is surrounded by a red rectangle, and is reachable only after k = 3 steps from the initial state. Note that consecutive states have shared inputs shifted, and each time-step sample is depicted in a different color.

In *whiRL 2.0*, we address this important gap by adding the means for proving that safety and liveness properties hold. To this end, we employ the method of *k-induction* [11].

Intuitively, the idea in k-induction is to look for state sequences of length k, which can start from arbitrary states in T (not necessarily from initial states), and for which the property is violated. If a violating execution exists, it must contain an indicative k-long sequence of steps — a suffix of the execution that ends in the bad state for safety properties, or a sequence of non-good states for liveness properties. Thus, if a verifier finds that a k-induction query is UNSAT, we know that the corresponding property holds. If, however, it returns SAT with a counter-example that does not start at an initial state, we cannot conclude whether the property holds, and must increase k in search of a conclusive answer. Fig. 3 depicts a snapshot of the k-induction process used for proving a safety property.

Fig. 3: Using k-induction to prove a safety property, i.e., that the system never reaches the bad state (surrounded by a red rectangle). Although there are k-long and (k + 1)-long execution sequences that end in the bad state, there is no such sequence of length (k+ 2); and due to this and to BMC on the base cases, the property holds.

More formally, following the terminology in [4], verifying ω-regular liveness properties is reducible to checking persistence properties of the form *"eventually forever* B*"*, where B represents a "bad" state (∃s s.t. B = ¬PG(s)). Using kinduction in the spirit of [6], [54], we can rule out the existence of k-long sequences of bad states for a given k (even ones not starting at an initial state). This is performed by formulating the following query:

$$\exists x\_1, x\_2, \dots, x\_k. \left(\bigwedge\_{i=1}^{k-1} T(x\_i, x\_{i+1})\right) \land \left(\bigwedge\_{i=1}^k \neg P\_G(x\_i)\right).$$

for increasingly large values of k. As soon as one such query returns UNSAT, we are guaranteed that the liveness property holds. A similar encoding can be used for proving safety properties.

We note that realizing k-induction in our case-studies entailed contending with challenges such as the need to encode verification queries that capture the system-environment interaction from *any* (possibly non-initial) state. An additional challenge was scalability; duplicating the network to encode k steps can induce an exponential blowup in running time. *whiRL 2.0* curtails the search space by using bound tightening mechanisms, and by enforcing certain dependencies between the inputs to the k duplicate networks encoded as part of a kinduction query. Specifically, these k inputs typically represent the k recent observations of the agent's environment, and can be restricted by requiring them to constitute a "sliding window": each pair of consecutive inputs must agree on the k − 1 previous observations that appear in both inputs.

BMC and k-induction are related techniques; the former is geared towards refuting a property, and the latter is geared towards proving it. In *whiRL 2.0*, we take a portfolio approach, as depicted in Fig. 4: we alternate between BMC and kinduction queries, until we: (i) refute the property (BMC returns SAT); or (ii) prove the property (k-induction returns UNSAT); or (iii) hit a timeout threshold. When steps 1 and 2 both fail, we increment k by 1 and repeat the process. Thus, although we do not know in advance whether the property in question holds, we hope that one of the two techniques will either find a counter-example or prove the property.

Fig. 4: *whiRL 2.0*'s verification schema.

Abstraction. In computer networking systems, such as the Aurora congestion controller, the system's state is often a set of observations about the environment. Through close inspection of our considered case-studies, we observe that occasionally some of the input fields are irrelevant to the property being checked, in the sense that the property can be proved even when disregarding them. We thus integrate into *whiRL 2.0 abstraction* capabilities [10] — the ability to strip off irrelevant input fields, as indicated by the user, when dispatching a verification query. The original transition system T is thus changed into an abstract transition system, T 0 , which overapproximates the original one. Specifically, the states of T 0 are symbolic, each corresponding to multiple states of T ; and s 0 1 T 0 → s 0 2 if and only if some states s<sup>1</sup> and s2, to which s 0 1 and s 0 2 correspond, satisfy s<sup>1</sup> T → s2. If the verification engine concludes that the property holds for T 0 (i.e., the negation of the property is UNSAT), it follows that it also holds for the original T . However, a counter-example for T <sup>0</sup> may be spurious, as it may not be valid for T , in which case the original query may need to be solved to obtain a definite result.

For example, in Aurora, the DNN input represents performance-related statistics pertaining to the t most recent rate adjustments made by the sender. In Aurora's implementation used for our evaluation, we chose t = 10 (as in [27]). In this context, abstraction might expose, for instance, that a certain property holds regardless of what values are assigned to the fields not relating to the 5 most recent rate changes, indicating that the policy is, in essence, dependent only on the 5 most recently observed statistics vectors.

We leverage the fact that inputs to recently-proposed computer networked systems consist of fairly few fields with natural semantic meaning, thus leading to a limited number of actual combinations of input fields that are abstracted.

In Section VI we demonstrate how *whiRL 2.0*'s abstraction capabilities can shed light on the inner workings of the verified system, rendering the "black-box" policy learned by the DRL system somewhat more translucent.

#### IV. INVARIANT INFERENCE

Verifying DRL systems is difficult, as one must often reason about transitions across many states to establish that a property holds. BMC and k-induction can mitigate this issue to some extent, but sometimes this is not enough. To further boost the scalability of *whiRL 2.0*, we enhanced it with semi-automated *invariant inference* capabilities.

In the context of safety verification of a transition system graph, an *invariant* can be regarded as a partition of the state space S into two disjoint sets, S<sup>1</sup> and S2, such that no transition leads from one set to the other: s<sup>1</sup> ∈ S1∧s<sup>2</sup> ∈ S<sup>2</sup> ⇒ hs1, s2i ∈/ T. Invariants are useful if we know that I ⊆ S<sup>1</sup> (all initial states are in S1) and PB(s) ⇒ s ∈ S<sup>2</sup> (all bad states are in S2). In this case, the existence of the invariant immediately guarantees that no bad states are reachable. Unfortunately, discovering such useful invariants is known to be undecidable in general, and very difficult to accomplish in practice [46].

As part of *whiRL 2.0*, we propose a heuristic for semiautomated invariant inference, which leverages common traits of communication networking systems. More precisely, we observe that many relevant properties in these systems can be regarded as *Boolean monotonic functions*; they tend to be satisfiable when the DNN's input vectors are allowed to fluctuate extensively, but quickly become unsatisfiable when these input vectors are restricted. Often, finding the tipping point, i.e., the minimal input restrictions that cause the property to shift from SAT to UNSAT, constitutes an invariant that is useful for proving other properties, and which can also render the policy learned by the DNN more translucent to humans.

We demonstrate these notions on the Aurora congestion controller. Recall that Aurora's output indicates whether the sending rate should be increased, maintained, or decreased. *whiRL 2.0* can search for an invariant that translates to the range of inputs for which the DNN outputs that the sending rate should be decreased. Such an invariant can assist in the verification of complex properties, and provide human engineers with comprehensible insights into the DRL system.

Technically, *whiRL 2.0* allows the user to specify the output property and mark the relevant input fields. For example, in Aurora's case, "the sending rate should be decreased" as the output property, and a subset of the input statistics as the relevant fields. Then begins a binary search on the range of the inputs in order to find the minimal restrictions that render the verification query UNSAT. At each step of the binary search, we invoke a black-box verification procedure to solve the resulting query. This allows us to locate the tipping point up to a prescribed precision. *whiRL 2.0* has built-in *templates* for input and output restrictions, which can be regarded as different strategies for conducting the aforementioned binary search. Each template takes into account either the DRL system's input variables or output variables, and controls them by adjusting their bounds; tightening them to "push" the query towards the UNSAT region. Currently, these templates include (i) for a fixed output, tightening or loosening the bounds of the specified input variables, executing binary search until the point in which the query switches from SAT to UNSAT is discovered; and (ii) performing a similar operation, but this time on the bounds of the specified output variables, while fixing the inputs according to user-specified constants.

Fig. 5 illustrates an invariant search procedure. In this procedure, we have a candidate invariant (the middle blue line)

that splits the search space into two parts. Ideally, the reachable states should all be on one side of the partition, and the bad states on the other side. Our binary search automatically adjusts the invariant candidate. In case an initial invariant candidate is too strong (there are reachable states on both sides), it is

Fig. 5: Invariant search procedure. The initial states are the green square labeled I, and the bad states are the red square labeled B.

weakened, and the line is moved towards B. If, however, the initial invariant candidate is too weak (there are bad states on both sides), it is strengthened, and the line is moved towards I. Both kinds of adjustments are performed by tightening or loosening the bounds on the input or output variables.

#### V. IMPLEMENTATION

We implemented *whiRL 2.0* as a Python framework that provides general functionality for verifying DRL systems. *whiRL 2.0* uses Marabou [31], a state-of-the-art SMT-based [5], [12], [14] DNN verifier, as a backend (although other verifiers could also be used). *whiRL 2.0* includes the following key modules, which did not exist in *whiRL 1.0*:


TABLE I: *whiRL 2.0* features used in each case study.


should be abstracted. When abstraction is applied, *whiRL 2.0* will either return UNSAT (if the abstract query returns UNSAT), or default to the original query if the abstract query returns a spurious counter-example.

Additionally, *whiRL 2.0* retains some of *whiRL 1.0*'s functionality, most notably its DNN loading interfaces and bounded model checking capabilities. The code for *whiRL 2.0*, alongside documentation and the experiments described in the paper, are all available online under a permissive license [1]. An appendix with the formulation of the verified properties is also available online [2].

#### VI. CASE STUDIES

We evaluate *whiRL 2.0* on three case studies of DRL systems: the *Aurora* [27] congestion controller, the *Pensieve* [43] adaptive video streamer, and the *DeepRM* [42] compute resource scheduler. All three case studies, which were used to illustrate the power of *whiRL 1.0* in [16], are from the domain of communication networks. We have identified such DRL systems as highly suitable candidates for evaluating DRL system verification techniques as they achieve state-of-the-art results despite being of moderate sizes, rendering verification tractable. Table I summarizes the *whiRL 2.0* capabilities applied in each case study. All experiments were conducted on an HP EliteDesk machine with six Intel i5 − 8500 cores running at 3.00 GHz, and with a 32 GB memory.

#### *A. The Aurora Congestion Controller*

*Aurora* [27] is a state-of-the-art DRL system that acts as a congestion controller for data transmission [27]. Aurora receives an input vector of size 3t, which consists of observations from the previous t time-steps. Specifically, the input consists of 3 distinct values representing performance-related statistics for each of the previous t rate changes outputted by the DNN: (i) *latency gradient*: the derivative of latency (packet delays) across time, as measured by the sender, following a change to the rate; (ii) *latency ratio*: the ratio of the average latency experienced by the sender, following a change to the rate, to the minimum past latency experienced. This value is never smaller than 1; and (iii) *sending ratio*: the ratio of the rate at which packets are injected into the network by the sender (i.e., the sending rate), to the rate at which the sent packets arrive at the receiver. We note that the latter rate can be strictly lower than the former rate if the network is congested, which can lead to sent packets being forced to wait in innetwork buffers, or being dropped along the way. The sending ratio is never smaller than 1. Intuitively, simultaneous low latency gradient, latency ratio, and sending ratio are indicative of excellent network conditions. Aurora has a single output value, which indicates whether the sending rate should be increased (positive output), decreased (negative output), or maintained (output is zero). When network conditions are good (low latency, no packet loss), this in indicative of the current rate not overshooting the network bandwidth. Hence, we expect the sending rate to increase so as to take over available bandwidth. In contrast, when network conditions are poor (high latency, high packet loss), this is indicative of network congestion, and so we expect Aurora to decrease the rate. See [16], [27] for additional details.

In line with previous work [16], [27], we set t = 10, i.e., the input size to Aurora's DNN is of size 3t = 30. Aurora's DNN has a single hidden ReLU layer with 48 neurons, and a single neuron in its output layer.

Proving Liveness. In our previous work [16], two liveness properties of Aurora were formulated, but could not be verified using *whiRL 1.0*. Using *whiRL 2.0*, we successfully proved that both properties from [16] always hold. Details follow.


Semi-Automatic Invariance Inference. Next, we used *whiRL 2.0*'s invariant inference capabilities to find invariants for proving safety properties of Aurora.

• Invariant A: bounding the next-step decrease in sending rate for excellent network conditions. When Aurora observes a history of excellent network conditions (low latency, no packet loss), the DRL agent's output should be nonnegative, i.e., should not imply a decrease to the sending rate. This safety property was shown to be violated in previous work [16]. Here, we utilize *whiRL 2.0*'s invariance inference techniques to prove a bound on this (undesirable) next-step decrease in sending rate, to provide visibility into the performance of the DRL system.

*whiRL 2.0*'s method for producing the desired invariant appears in Alg. 1. The algorithm takes two user inputs: the *latency slack* , and the *precision* η. The input captures the notion of "excellent network conditions" encoded as inputs to the DNN: the observed latency gradient is restricted to the range [−, ]; and the observed latency ratio is restricted to the range [1, 1 + ]. Additionally, the sending ratio is set to 1 (indicating that sent traffic arrives at the receiver without being delayed or dropped within the network). The algorithm now performs a binary search over the DNN's output space (leaving the prescribed input ranges for the DNN fixed). Specifically, the η input specifies the desired precision: the output of the algorithm will be an upper bound b on the DNN's output, such that the output b is impossible, but b + η is possible, given the aforementioned input restrictions. Recall that the upper bound b relates to the *negation* of the desired property, and so an upper bound of b implies that Aurora's DNN will never decrease the sending rate by b *or more* when network conditions are excellent. This procedure terminates within a few seconds, returning an upper bound on the input for which the DNN verifier returns UNSAT. The algorithm's correctness immediately follows from the underlying verifier's soundness.

# Algorithm 1 *Finding Invariant* A

Input: , η // *latency slack, precision* Output: UBUNSAT // *worst-case output decrease bound* 1: UBUNSAT ← −∞ // −M*, for some large constant* M 2: UBSAT ← 0 3: QUERY ← DNN VERIFY ( , output ≤ 0 ) 4: while ( |UBSAT − UBUNSAT| ≥ η ) do 5: OUTUP P ER ← <sup>1</sup> 2 ( UBUNSAT + UBSAT ) 6: QUERY ← DNN VERIFY (, output≤ OUTUP P ER ) 7: if QUERY is SAT then UBSAT ← OUTUP P ER 8: if QUERY is UNSAT then UBUNSAT ← OUTUP P ER 9: return UBUNSAT

• Invariant B: inferring when Aurora fails to decrease the next-step sending rate even though network conditions are poor. We now wish to characterize poor network conditions in which Aurora does not decrease its sending rate, as expected of it. The procedure is described in Alg. 2. Now, the sending ratio is not fixed to 1, but is rather within the range [1, *P*], for a user-specified *P* value. *P* represents a user-provided upper bound on ratio of the rate at which packets leave the sender (i.e., the sending rate) to the rate which these packets arrive at the receiver. For a slack , the procedure again restricts the latency gradient to the range [−, ] and the latency ratio to the range [1, 1 + ]. Intuitively, setting low values for while allowing sending ratios to be high corresponds to sending traffic across communication networks in which in-network buffers are very shallow. In such networks, packets cannot accumulate within the network, resulting in low latencies for packet delivery. However, since in-network buffers are shallow, packets are dropped once network bandwidth is even slightly exceeded, resulting in high sending ratios when the sending rate significantly overshoots the network's capacity (and many packets are lost).

The algorithm fixes the output's lower bound to be nonnegative, and executes a binary search on the input sending ratio. Specifically, the algorithm returns, for any user-chosen value *P*, a lower bound (LBUNSAT) such that Aurora always decreases the sending rate when its observations regarding past sending ratios all lie within the range [LBUNSAT, *P*]. *whiRL 2.0* finds the invariant within a few seconds.

#### Algorithm 2 *Finding Invariant* B

Input: *P* ≥ 2 // *upper bound on the sending ratio* Output: LBUNSAT // *worst-case sending ratio bound* 1: LBSAT, SRLOWER ← 1 2: LBUNSAT, SRUPPER ← *P* 3: QUERY ← DNN VERIFY ( , output ≥ 0, SRLOWER, SRUPPER ) 4: while ( LBSAT + 1 < LBUNSAT ) do 5: SRLOWER ← <sup>1</sup> 2 ( LBSAT + LBUNSAT ) 6: QUERY ← DNN VERIFY ( , output ≥ 0, SRLOWER, SRUPPER ) 7: if QUERY is SAT then LBSAT ← SRLOWER 8: if QUERY is UNSAT then LBUNSAT ← SRLOWER 9: return LBUNSAT

Observing the bounds produced by Alg. 2 yielded surprising insights regarding the decision-making policy learned by Aurora. Specifically, to gain insight into what our discovered invariants reveal regarding the policies, we created multiple instances of Aurora agents, and trained them all on the same training data until achieving an averaged reward value similar to that of the original Aurora controller [27]. We then observed that for some of the Aurora instances, the discovered invariants depended only on the *proportion* between the sending ratio's lower bound (SRLOWER) and upper bound (SRUPPER), as opposed to their *absolute* values. Specifically, for violating counter-examples (inputs to Aurora's DNN) produced for these instances, the ratio between the highest and lowest past sending ratios was at least 2, with lower ratios giving rise to desirable behavior by Aurora. For other trained instances of Aurora, violating counter-examples only depended on the absolute values of the bounds; e.g., Aurora always decreases the rate for inputs to the DNN where all sending ratios lie in the range [1, M] for some value M, but not when these lie in the range [1, M +δ] for some small δ. Our findings show that policies that yield the same expected reward on the training set might *generalize* very differently to inputs that lie outside this training set, and that our discovered invariants can shed light on the generalization strategies of different policies learned.

#### *B. The Pensieve Video Streamer*

*Pensieve* is a DRL system [43] for *adaptive bitrate* (ABR) selection. To provide high quality of experience for video clients, Pensieve continuously collects statistics about the client's experience when downloading video chunks (e.g., was the video rebuffered? how long did it take to download the chunk?) to dynamically adapt the resolution at which the next video chunk is downloaded from the video server. Each video chunk represents a fixed-duration video segment (e.g., 4-second-long chunks in our experiments) encoded in one of several possible resolutions (SD, HD, etc.), with higher resolutions corresponding to larger chunks, in terms of number of bits. When client-sensed network conditions are good, we expect the ABR algorithm to decide that the next video chunk will be downloaded in high resolution (HD); and when they are poor, we expect a low resolution (SD) to be selected, to avoid having the client not finish the download in time, which leads to video rebuffering. The input to Pensieve's DNN consists of (2t + M + 3) fields, where t > 0 represents the number of recent video chunk downloads considered, and M > 0 represents the number of available video resolutions. The input comprises: (i) the *bitrate* (1 field) in which the last video chunk was downloaded; (ii) the current *video buffer size* (1 field) of the client, reflecting the number of seconds of unwatched video stored at the client; (iii) network *throughput measurements* for video chunks downloaded in the past t time-steps (t fields); (iv) *download times* for the video chunks downloaded in the past t time-steps (t fields); (v) *resolution options* (M fields) to download the next chunk; and (vi) the number of *remaining chunks* to be downloaded (1 field). See [43] for a thorough exposition of Pensieve, and [16] for a formalism of the Pensieve verification challenge.

To maintain consistency with Pensieve's original hyperparameters, in our experiments t = 8 and M = 6. Due to the nature of an ABR algorithm, all executions are finite (downloads finish in finite time), and so all relevant properties are safety properties. In previous work [16], *whiRL 1.0* was applied to check two safety properties of Pensieve:


While Property 1 was shown not to hold [16], no counterexamples could previously be found for Property 2, and so it could neither be proved nor disproved using existing tools.

Using *whiRL 2.0*, we were able to prove that Property 2 indeed holds under certain, realistic, assumptions.<sup>1</sup> To achieve this, we applied k-induction, with k = 1. The result returned by the verifier indicated that the bad states are unreachable, and, hence, that the undesirable behavior cannot occur. These verification queries took approximately 20 minutes to solve.

#### *C. The DeepRM Resource Manager*

*DeepRM* [42] is a DRL-based resource manager, responsible for allocating various cluster compute resources (e.g., CPU, memory) to queued jobs, in order to optimize the cluster's throughput. DeepRM receives the following as input: (i) the *current resource usage* in the system; (ii) a *queue* with up to

<sup>1</sup>We assumed that chunks represent 4-second-long video segments. Considered chunk download times are between 4 to 15 seconds per chunk, which implies that downloading each chunk takes longer than consuming it.

Q pending jobs waiting to be scheduled; and (iii) a *backlog*, indicating the number of jobs waiting to be scheduled that are not yet in the queue. For a fixed Q-sized job queue, the DeepRM controller may output one of (Q+1) possible actions: a *wait* action (i.e., no resources will be allocated at this timestep), or a *schedule*<sup>q</sup> action for 1 ≤ q ≤ Q, indicating that job q should be scheduled next. DeepRM's output is interpreted as a probability distribution, assigning a certain probability to each of the (Q + 1) possible actions. We refer the reader to [42] for a thorough exposition of DeepRM, and to [16] for a formalism of the DeepRM verification challenge.

In our case study, as in [16], we used a DeepRM system trained with R = 2 resources: *CPU* and *memory units*, and a job queue of size Q = 5. Overall system resources consist of 10 CPUs and 10 memory units. We considered two kinds of jobs: *small* jobs, which require 1 CPU and 1 memory unit for a single time-step, and *large* jobs, which require 10 CPUs and 10 memory units, for t = 20 time-steps.

Previous work [16] considered the following safety properties for DeepRM:


Using *whiRL 1.0*, it was shown [16] that Property 1 holds, and that there exist counter-examples for Properties 2 and 3. However, by using *whiRL 2.0* we were able to prove (within a few seconds) a stronger property that, in fact, generalizes properties 1, 2 and 3. By applying *whiRL 2.0*'s abstraction capabilities to both the inputs indicating resource utilization and the output indicating the recommended action, we proved that for *any* resource utilization level, when the queue is filled with identical jobs, the DRL system's output assigns a higher probability to *schedule*<sup>2</sup> than to *wait*. This immediately proves Property 1, and implies that Properties 2 and 3 cannot hold.

This finding sheds new light on previous results, and enhances our understanding of DeepRM: (i) the three original properties do not depend on the current resource utilization. Rather, due to the DRL system learning a suboptimal policy, it is biased towards scheduling a specific job (job #2), and may fail to select *wait* when appropriate; and (ii) the counterexamples found for Properties 2 and 3 are not outliers, but rather the general case. Indeed, we were able to use *whiRL 2.0* to prove that the inverses of both these properties always hold. These results demonstrate that, beyond proving or disproving specific properties, *whiRL 2.0* can shed light on the policy learned by the DRL system, and expose problematic issues.

#### VII. RELATED WORK

Due to the increasing use of DNNs, many DNN verification tools have been proposed in recent years; some are SMT- based (e.g., [28], [31], [35], [47]), whereas others use different verification strategies, such as *abstract interpretation* [48], [56], [59], *mixed integer linear programming* (MILP) [52], and many others. Recently, these approaches were extended to verify systems with multi-step executions, such as Recurrent Neural Networks (RNNs) [26], [58] or hybrid systems [50].

In our evaluation of *whiRL 2.0*, we used *Marabou* [31], [57] as a black-box DNN verifier. To date, Marabou has mostly been applied for solving adversarial robustness queries [3], [8], [24], [29], and our work demonstrates that it is also applicable in the field of computer and networked systems. Marabou affords additional features, such as built-in abstraction [15], simplification [20], [36], repair [21] and optimization [49] techniques, which could also be applied to our case studies.

In addition to general DNN verification engines, methods have been devised to formally verify safety properties of DRL systems, which are the subject matter of this work. Such approaches include *shield synthesis* [33], and combining the verification process with *verified runtime monitoring* [18]. Other methods focus on finding adversarial attacks that pertain specifically to DRL agents, e.g., by using MILP [13].

In addition to the *whiRL* project, other approaches have been proposed for verifying DRL systems in the domain of communication networks. These include, e.g., *Verily* [32] and *Metis* [44]. Importantly, however, our focus is on verifying (as opposed to only refuting) various safety and liveness properties of these systems. To the best of our knowledge, this lies beyond the grasp of other existing tools.

#### VIII. CONCLUSION

DRL systems provide excellent performance in multiple settings, but suffer from severe vulnerabilities. Several verification tools have been developed to mitigate this concern, but these mostly refute, as opposed to prove, safety and liveness properties of interest. In this work, we presented *whiRL 2.0* — a novel verification engine that supports proving both safety and liveness properties of DRL systems. *whiRL 2.0* accomplishes this through semi-automatic invariance inference, alongside techniques such as k-induction and query abstraction. We demonstrated our tool's capabilities through three case studies from the communication networks domain. In addition, we demonstrated how *whiRL 2.0* can provide insights into the inner workings of these systems, uncovering weaknesses that would otherwise remain unnoticed.

In the future, we plan to enhance our tool's scalability by using improved search heuristics. Also, we intend to enrich the semi-automatic invariant inference templates to support searching for more complex invariants.

Acknowledgements. We thank Nathan Jay, Tomer Eliyahu and the anonymous reviewers for their contributions to this project. The project was partially supported by the Israel Science Foundation (grant number 683/18), the Binational Science Foundation (grant numbers 2017662 and 2019798), and the Center for Interdisciplinary Data Science Research at The Hebrew University of Jerusalem.

#### REFERENCES


*16th. Int. Symposium on on Automated Technology for Verification and Analysis (ATVA)*, pages 3–19, 2018.


# Exploiting Isomorphic Subgraphs in SAT

Alexander Ivrii IBM Haifa Research Lab, Israel alexi@il.ibm.com

*Abstract*—While static symmetry breaking has been explored in the SAT community for decades, only as of 2010 research has focused on exploiting the same discovered symmetry dynamically, during the run of the SAT solver, by learning extra clauses. The two methods are distinct and not compatible. The former may prune solutions, whereas the latter does not – it only prunes areas of the search that are guaranteed not to have solutions, like standard confict clauses. Both approaches, however, require what we call *full symmetry*, namely a propositionally-consistent mapping σ between the literals, such that σ(φ) ≡ φ, where here ≡ means syntactic equivalence modulo clause ordering and literal ordering within the clauses. In this article we show that such full symmetry is not a necessary condition for adding extra clauses: isomorphism between possibly-overlapping subgraphs of the colored incidence graph is suffcient. While fnding such subgraphs is a computationally hard problem, there are many cases in which they can be detected a priori by analyzing the high-level structure of the problem from which the CNF was derived. We demonstrate this principle with several well-known problems.

# I. INTRODUCTION: SYMMETRY, ALMOST SYMMETRY, AND E-CLAUSES

Symmetry breaking [22] is a well known technique for accelerating SAT solving, which originated decades ago by Puget [21] for CSP, and later by Crawford et al. [8] for CNF. Symmetry-breaking for CNF was implemented effciently in the tool SHATTER [4] and later improved in BREAKID [11]. In a nutshell, it means that new predicates, called *symmetrybreaking* predicates, are added to the input formula φ, without changing its satisfability. These predicates prune the search space and are likely to remove solutions, but without changing the satisfability of the formula. The construction of those predicates is based on fnding a mapping σ between the literals of the input formula φ, such that σ(φ) ≡ φ. Here '≡' means syntactic equivalence modulo clause ordering and literal ordering within the clauses. The mapping has to be *propositionallyconsistent*, which means that ∀v1, v<sup>2</sup> ∈ var(φ). σ(v1) = v<sup>2</sup> ⇒ σ(̄v1) = ̄v<sup>2</sup> and σ(v1) = ̄v<sup>2</sup> ⇒ σ(̄v1) = v2. If we fnd such a mapping, then it means that every satisfying solution α to φ has the property that σ(α) also satisfes φ. We can then add a constraint that prunes one of those solutions. As an example, consider

$$
\varphi = (1\ \text{-3})(2\ \text{-3})(1\ \text{-2}\ \text{3})(-1\ \text{-2})\ .
$$

and the mapping σ : 1 ↦→ 2, 2 ↦→ 1 (by convention, each such mapping implies that the mapping of the negated literals is

Ofer Strichman Information System Engineering, IE, Technion, Haifa, Israel ofers@ie.technion.ac.il

also included in σ, e.g., −1 ↦→ −2 ∈ σ). We see that

$$
\sigma(\varphi) = (2 \text{ -3)} (1 \text{ -3)} (2 \text{ 1 3)} (\text{-2 -1}) \text{ ,}
$$

and that σ(φ) ≡ φ. Indeed if we take any solution α to φ, we see that σ(α) is a solution as well. For example, for α = (1, 2, 3) ↦→ (T, F, F) we have α |= φ, and indeed σ(α) |= φ as well, since σ(α) = (1, 2, 3) ↦→ (F, T, F). Crawford et al. showed how to add symmetry-breaking constraints, which we will not detail here. In this case it may amount to adding the clause (-1 2), which indeed in this case excludes the frst solution without excluding the second one. Such pruning of solutions is in many cases helpful for shortening the overall run-time [4], [17].

Symmetry-breaking tools discover such mappings by analyzing the colored literals incidence graph<sup>1</sup> G with respect to multiple potential mappings Σ: if for σ ∈ Σ it holds that σ(G) ≡ G (this is called 'automorphism'), then σ defnes a symmetry. The isomorphism in this case is restricted such that for every two nodes, n1, n<sup>2</sup> ∈ G, if σ(n1) = n<sup>2</sup> then n<sup>1</sup> and n<sup>2</sup> must have the same color, i.e., clause nodes are mapped to clause nodes and literal nodes to literal nodes.

Another way to exploit symmetry is by adding clauses during search. Henceforth we will call such clauses 'e-clauses', for 'Extra' clauses. This option has mostly been researched in the CSP community, under the names *Symmetry breaking during search - SBDS* [5], [14], [15], [7] and *Symmetry Breaking by Dominance Detection - SBDD* [13]. In the SAT community this route was frst explored via the Symmetrical Learning Scheme (SLS) [6], which adds new clauses during the search based on learned clauses and a pre-computed set of symmetry 'generators'. SLS was later improved by Symmetry Propagation (SP) [9], which only adds such extra clauses if they lead to further (immediate) propagations, and several years later by Symmetric Explanation Learning (SEL) [10], which is integrated within BCP (it takes the reason clause of the propagation as the base for adding e-clauses). According to [10], SEL is the only one of those that is competitive with modern static symmetry breaking. Finally, [25] has a similar scheme in which e-clauses are only added if the learned clause has a low LBD. In [10] those methods were jointly called *dynamic* symmetry *handling*, to emphasize that

<sup>1</sup>Such a graph is constructed from a CNF by introducing a vertex for each literal and each clause, connecting opposite literals with an edge, and connecting the literals to the clauses that they are part of. The clauses' nodes have one color, and the literals' nodes have a different color.

unlike *static* symmetry *breaking* they are based on an analysis during the search (hence 'dynamic'), and that they do *not break symmetry*, as they do not remove solutions. We fnd this name inadequate, however, because symmetry does not need to be 'handled'. A more proper name is dynamic symmetry *exploitation*, which is the name we will use in the rest of this article. Although static symmetry breaking and dynamic symmetry exploitation are based on the same data – the symmetries in the formula – they are not compatible. One cannot use dynamic symmetry exploitation if the symmetries it relies on are broken by added predicates.

Dynamic symmetry exploitation was also studied for the case of *almost symmetric* formulas (also called 'weak symmetry') [19], [7], formalized as follows. Let

$$
\varphi \equiv \varphi\_1 \cup \varphi\_2 \; , \tag{1}
$$

where here we equate formulas φ, φ1, φ<sup>2</sup> with sets of clauses. Let σ be a literal map of φ such that

$$
\sigma(\varphi\_2) \equiv \varphi\_2 \,. \tag{2}
$$

This refects a common scenario, where a few clauses – marked here by φ<sup>1</sup> – disrupt the symmetries in the formula. The main method that was suggested in these references is to add e-clauses based on φ2. That is, once a clause c is learned from φ<sup>2</sup> alone, add σ(c) as well.

In this article we observe that the requirement of symmetry as used by all of those prior works on dynamic symmetry exploitation is a suffcient, yet not a *necessary* condition for adding e-clauses. We will need the following defnitions for explaining this claim.

*Defnition 1 (The* refned *colored incidence graph):* The *refned* version of a colored incidence graph assigns separate colors to clauses of different arity.

We will denote this graph by G, assuming the underlying formula is clear from the context (it can also include learned clauses).

*Defnition 2 (The subgraph induced by a resolution sequence):* Given a resolution sequence c1, . . . , cn, its corresponding *induced* subgraph in G is comprised of the subgraphs induced by these clauses, and the edges between opposite literals that were resolved in the sequence.

Now, consider such a resolution sequence c1, . . . , c<sup>n</sup> that was used for learning a clause c (c itself is not part of the sequence), and its corresponding induced subgraph g. Consider also another subgraph g ′ of G that is color-isomorphic to g. It is not hard to see that g ′ refects another possible resolution sequence in the formula, ending with a different clause, which we can add as an e-clause. This criterion is *ad-hoc* and does not require automorphism of the original formula or some pre-defned part of it as in almost-symmetries. In fact, it can be seen as an application of the SR-II inference rule suggested by Krishnamurthy in [18] already in 1985 (there was no indication, however, how a solver may exploit that rule in [18]). In some types of formulas, fnding e-clauses based on this reasoning is computationally cheap, and can lead to improvements in the overall run-time of the solver. The important point is that this technique can be applied even when there is no mapping σ such that σ(φ) ≡ φ, which implies that this technique can derive e-clauses that cannot be derived by the above-mentioned symmetry exploitation techniques.

In fact, this idea was implicitly used in the past by the second author [24] for adding e-clauses in the case of boundedmodel checking problems, and by Say et al. for adding such clauses in the case of optimizing a planning process with neural networks [23]. Both references reported performance gains. In this article we give a general view that encompasses also these two references, and show that the potential for such clauses is present in many other types of formulas.

*Example 1:* Let φ be comprised of the following clauses:

$$\begin{array}{ccccccccc}(1\ 2\ 3) & (-1\ -2\ -3) & (2\ 3\ 4) & (-2\ -3\ -4) & (-2\ -3\ -4) \\ (3\ 4\ 5) & (-3\ -4\ -5) & (4\ 5\ 6) & (-4\ -5\ -6) & (-4\ -5\ -6) \\ (5\ 6\ 7) & (-5\ -6\ -7) & (1\ 3\ 5) & (-1\ -3\ -5) & (-3\ -2\ -4\ -6) \\ (2\ 4\ 6) & (-2\ -4\ -6) & (3\ 5\ 7) & (-3\ -5\ -7) & (-1\ 4\ 7) & (1\ 4\ 7) \\ (1\ 4\ 7) & (-1\ -4\ -7) & & & & & \end{array}$$

It happens to be the Van der Waerden formula (3,3; 7). We will describe this type of formulas later, in section III-A.

Symmetry breaking, as emitted by BREAKID, discovers the two mappings below (these are also called 'generators'). To get to the full set of possible mappings one needs to also consider their compositions.

$$
\sigma\_1 \colon \begin{array}{l}
\text{ $ [ 1 7 ] [ 2 6 ] [ 3 5 ] $ } \\
\sigma\_2 \colon \begin{array}{l}
\text{ $ [ 1 - 1 ] [ 2 - 2 ] [ 3 - 3 ] $ } \dots \text{[ 7 - 7 ]} \\
\end{array} \end{array} \tag{4}
$$

This representation is called 'cycle form', and should be interpreted as follows: in each line, every literal appears at most once; it should be replaced with the literal that comes next in the brackets, and if it is the last one then with the frst literal in the brackets. In this example σ<sup>1</sup> implies that simultaneously swapping literals 1 and 7, 2 and 6, 3 and 5 (and correspondingly, their negated versions, -1 and -7, etc.) results in the same formula. Readers familiar with Van der Waerden formulas may notice that this symmetry corresponds to a reversal of the indices, i.e., the frst variable becomes last, the second one becomes second to last, etc, and that σ<sup>2</sup> corresponds to a swap of the colors. In such formulas, regardless of their length, these are the only two possible symmetries.

Now suppose that we learn a new confict clause c = (1 2 -5 6), via the following resolution sequence:

$$(1\ 2\ 3), (\cdot 3\ -4\ -5), (2\ 4\ 6)\ . \tag{5}$$

We can therefore add two e-clauses corresponding to the two generators:

$$
\sigma\_1(\text{l 2 -5 6}) = (\text{7 6 -3 2}) \qquad \sigma\_2(\text{l 2 -5 6}) = (\text{-l -2 5 -6}) \tag{6}
$$

However, more e-clauses can be derived based on this confict clause. We need to fnd a subgraph of G that is colorisomorphic to the one representing the sequence (5). Going back to our example, it is indeed not hard to see that (2 3 4), (-4 -5 -6), (3 5 7), all of which are clauses in φ, give

$$
\begin{array}{ccccccccc}
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\downarrow & \downarrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow & \uparrow \\
\end{array}
$$

Fig. 1. Two isomorphic subgraphs of the same refned colored incidence graph corresponding to (3). The literals are nodes with a separate color than the clause nodes. All the clause nodes in this example are of the same arity, hence they have the same color.

us just that – see Fig. 1. Applying the same resolution steps yields a new e-clause (2 3 -6 7), which cannot be deduced by any composition of σ1, σ2, simply because our inference is not based on the original CNF's symmetry, rather it is inferred dynamically from the resolution process.

Since the subgraph isomorphism problem is NP-complete, we only focus on cases in which it can be indirectly inferred from analyzing the high-level structure of the original problem and controlling (or knowing) how it is encoded. Specifcally, in such problems we derive a mapping between the literals, and adapt the solver to use this information in order to derive new e-clauses. Our implementation of this technique shows average overall improvement in terms of run-time.

To summarize, our contributions in this article are:


Although any paper that mentions an open mathematical problem such as Van der Warden numbers raises the expectation that it was able to solve it (i.e., fnd a new Van der Waerden number), this is not a result that can be found here: we only use it as one of several examples of problem domains in which the high-level structure can be used for improving run-time.

We continue in the next section by describing the method in detail. In Sec. III we will demonstrate how to apply it with several famous problems.

#### II. FINDING ADDITIONAL E-CLAUSES

Let us recap. *Symmetry* over φ is a propositionallyconsistent map σ : lits(φ) ↦→ lits(φ) such that σ(φ) ≡ φ. In this situation we can add symmetry-breaking constraints, and also use dynamic symmetry exploitation by adding e-clauses, but not both.

*Almost symmetries* refer to a situation where we have a formula φ ≡ φ<sup>1</sup> ∪ φ<sup>2</sup> and a propositionally-consistent map σ : lits(φ2) ↦→ lits(φ2) such that σ(φ2) ≡ φ2. Here we *cannot* add symmetry-breaking constraints because of the φ<sup>1</sup> clauses, but we can still use dynamic symmetry exploitation by adding e-clauses that are based on φ2.

We now generalize almost symmetries as follows. Let

$$
\varphi \equiv \varphi\_1 \cup \varphi\_2 \cup \varphi\_3 \; , \tag{7}
$$

where φ, φ1, . . . are sets of clauses, possibly overlapping. Let σ : lits(φ2) ↦→ lits(φ3) be a literal map such that

$$
\sigma(\varphi\_2) \equiv \varphi\_3 \ . \tag{8}
$$

Our central claim is:

*Proposition 1:* Let c be a confict clause that was learned from φ2's clauses, i.e., φ<sup>2</sup> |= c. Then φ and φ∪σ(c) have the same solutions.

*Proof:* Consider the resolution process by which c was inferred from φ2. The same resolution process can be applied to σ(φ2), and the result will be σ(c). Hence σ(φ2) |= σ(c), and because of (8) we have φ<sup>3</sup> |= σ(c). Therefore, φ |= σ(c) and we can add the e-clause σ(c) to φ without removing solutions.

The following table summarizes the discussion so far.


For a given formula φ, the question is how to defne φ2, φ<sup>3</sup> and the corresponding mapping σ that satisfy (8). As we will see in the next section, for certain types of formulas it can be done in such a way that e-clauses can be added in linear time. In fact it can be done in multiple ways, i.e., many such mappings exist, and we can use all of them.

#### III. EXAMPLES

We will show here two example problems that received attention in the SAT community in recent years, and in which e-clauses can be added effciently : Van der Waerden numbers, and Boolean Pythagorean triples. The long version of this article [1] includes additional examples: Bounded model checking, SAT-based Planning, a combinatorial problem called 'Sweep', and the anti-bandwidth problem.

#### *A. Van der Waerden numbers (2 colors)*

We begin with the following defnition:

*Defnition 3:* The Van der Waerden number W(j, k) is the smallest integer n such that every 2-coloring of 1..n has a monochromatic arithmetic progression of length j of color 1, or of length k of color 2.

For example, the following coloring proves that W(3, 3) > 8, since there is no arithmetic progression of size 3 of either color:

.

However, there is no such coloring for n = 9, hence W(3, 3) = 9.

There is relatively little symmetry in such formulas. An obvious one is the symmetry between the colors, when j = k. Another type of symmetry is reversal (reading the sequence from the end). Reconsidering Example 1, σ1, σ<sup>2</sup> of (4) break these two symmetries.

Given j, k and n, encoding the decision problem whether W(j, k) > n with CNF is simple. Defne n variables x<sup>i</sup> for 1 ≤ i ≤ n, indicating whether location i is assigned the color '1'. The constraints on the arithmetic progression are given by

$$\bigcup\_{\{\left(\bar{x}\_i \lor \bar{x}\_{i+d} \lor \cdots \lor \bar{x}\_{i+(j-1)d}\right) \mid i \in [1, n - (j-1)d], d \ge 1\}} \{ (\bar{x}\_i \lor \bar{x}\_{i+d} \lor \cdots \lor \bar{x}\_{i+(j-1)d}) \mid i \in [1, n - (k-1)d], d \ge 1\}, \tag{9}$$

as was described, e.g., by Knuth in [17]. From here on we will use integers as representatives of literals.

*Example 2:* Consider the case of j = k = 3, n = 10. When a variable i is assigned true, it represents the decision to assign slot i the color '1', and '0' otherwise. Then no 3 slots...


The same constraints, but with negated literals, are now added for the color '1'. For example, for gap 1, add (–1,–2, –3) ... (-8, -9, -10), etc.

The clauses as defned in (9) have what we call a *gliding symmetry*<sup>2</sup> . This means that the same clause is replicated in the formula while shifting the variable index by a constant up to some bound, for example (1 2 3) is in φ, but also (2 3 4)...(8 9 10). Similarly (-1 -2 -3) is replicated with a negative constant. For a clause c, let c i <sup>z</sup> denote the clause attained by taking i steps towards zero, and similarly let c i <sup>n</sup> denote the clause attained by taking i steps away from zero, i.e., towards n or −n. For example (3 4 5)<sup>1</sup> <sup>z</sup> = (2 3 4) and (1 2 3)<sup>1</sup> <sup>n</sup> = (2 3 4). As another example, this time focusing on the negative constraints, (-1 -3 -5)<sup>1</sup> <sup>n</sup> = (-3 -5 -7)<sup>1</sup> <sup>z</sup> = (-2 -4 -6).

For each clause c ∈ φ, we save the *gliding bounds* [i, j], where i, j are the maximal integers such that c i z , c<sup>j</sup> <sup>n</sup> ∈ φ. For example, for the clause c = (2 3 4) of Example 2, we save the pair [1, 6], because we can 'glide' by up to one step towards zero and by up to six steps towards n = 10 (giving us, respectively, (1 2 3) and (8 9 10)). As another example, the pair for the clause (-4 -5 -6) is [3,4], because we can glide by up to three steps towards zero, and by up to four steps towards −n = −10. Denote by c.z and c.n the two bounds of a clause c, corresponding to i, j above, respectively.

So far we only considered the original clauses of the problem. We now consider the question of what are the bounds for the learned clauses. Let c1, . . . , c<sup>m</sup> be the antecedent clauses of a new learned clause c. We compute the gliding bounds of c as follows:

$$c.z = \min(c\_1.z, \dots, c\_m.z) \qquad c.n = \min(c\_1.n, \dots, c\_m.n) \,. \tag{10}$$

<sup>2</sup>Mathematicians use this term for describing a pattern that repeats itself by an operation of shifting in one dimension in space, e.g., ♠ ♠ ♠ ♠ . . .

The rational of (10) is that we can only glide c towards zero (or away from zero) as much as we can glide all of its antecedents towards zero (or away from zero).

Given the gliding bounds of each clause, it is easy to use Proposition 1 for learning new e-clauses. Using the terminology of that proposition, the antecedents of c form φ2, and σ is a mapping that applies 'gliding' to them. Each amount of gliding is a separate mapping σ. The gliding bounds tell us the amount by which gliding each clause results in a clause that is still in φ – those new clauses are φ<sup>3</sup> in the proposition. In other words, those bounds defne the mappings that we can use for deriving new e-clauses.

*Example 3:* Suppose φ includes the following clauses and respective bounds:

$$(\text{3 6 10}[2, 0] \quad (\text{-7 -5 -3})[2, 2] \quad (\text{-7 -6 -5})[4, 2] \qquad (11)$$

from which the solver inferred via resolution the clause c = (-7 -5 10). With (10) we compute the gliding bounds [2,0] for c. This means that we have two mappings:


i.e., a glide by one and two towards 0. So we add the e-clauses σ1(c) = (-6 -4 9) and σ2(c) = (-5 -3 8). Indeed, if we apply σ<sup>1</sup> to the clauses in (11), we get three clauses in φ, from which we can infer σ1(c):

$$\begin{array}{ll} \sigma\_1(\text{3 6 }10) = (\text{2 } \text{5 } 9) & \sigma\_1(\text{-7 } \text{-5 } \text{-3}) = (\text{-6 } \text{-4 } \text{-2})\\ \sigma\_1(\text{-7 } \text{-6 } \text{-5}) = (\text{-6 } \text{-5 } \text{-4}) \end{array}$$

Finally, we should compute the gliding bounds of the eclauses themselves, because they may participate in further learning. For this, we shift the bounds of the confict clause by the same amount as dictated by the mapping σ, while recalling that any step towards zero is a step away from n (or −n if it is a negative literal), and vice a versa.

*Example 4:* Reconsider c of Example 3. Its bounds are [2,0]. We computed σ1(c) by gliding c towards zero by 1. Hence the bounds of σ1(c) are [2−1,0+1] = [1,1].

#### *B. Boolean Pythagorean triples*

We conclude with an example that shows that e-clauses are not necessarily tied to gliding symmetry.

Three positive integers a, b, c are called a Pythagorean triple if they satisfy a <sup>2</sup> + b <sup>2</sup> = c 2 . The challenge is:

*Defnition 4:* For a given n ∈ N, can 1..N be separated into two sets, such that no set contains a Pythagorean triple?

As an example, for n = 17 if we choose the subset of integers that is here marked with an underline: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17, it proves that for n = 17 the answer is yes.

The general question of whether there exists an n for which the answer is negative was open for many years. The celebrated result of Heule et al. [16] a few years ago proved, with the help of a SAT solver, that the answer is positive.

The encoding of the problem in Def. 4 with CNF is very simple: defne n variables, where the Boolean values in the satisfying assignment separate the values naturally to the two requested sets. For example, the encoding for n = 17 is

$$\begin{array}{cccc} \text{(3 4 5)} & \text{(-3 -4 -5)} & \text{(5 12 13)} & \text{(-5 -12 -13)}\\ \text{(6 8 10)} & \text{(-6 -8 -10)} & \text{(8 15 17)} & \text{(-8 -15 -17)}\\ \text{(9 12 15)} & \text{(-9 - 12 -15)} & \end{array}$$

Denote by φ<sup>n</sup> this formula for a given n. In the discussion that follows we will overload the multiplication and division signs, '·' and '/' to operate on clauses and sets of clauses: the operation is simply applied to each of the literals. For example, 2 · (3 4 5) = (6 8 10) and (6 8 10)/2 = (3 4 5).

We begin with two simple observations:

*Observation 1:* Pythagorean triples are closed under multiplication:

$$\forall a, b, c, i \in N. \ a^2 + b^2 = c^2 \Rightarrow (a \cdot i)^2 + (b \cdot i)^2 = (c \cdot i)^2 \dots$$

*Observation 2:* Let |<sup>d</sup> denote 'divisible by d'. When applied to a set of numbers, then it means that all the set's members are divisible by d. Then for all n,

$$\mathbf{(a \ b \ c)} \in \varphi\_n \land \mathbf{(a \ b \ c)}|\_d \Rightarrow \frac{\mathbf{(a \ b \ c)}}{d} \in \varphi\_n \,. \tag{12}$$

The second observation is simply the other side of the frst one (dividing rather than multiplying), but it also states that the divided clause must be in φn. For example, if n = 80 then (30 72 78) ∈ φ80, which implies that also (30 72 78)/2 = (15 36 39) ∈ φ80.

For each clause c, we defne recursively

$$c.gcd = \begin{cases} \gcd(\{l \mid l \in c\}) & \text{c is original} \\ \gcd(\{c\_i.gcd \mid c\_i \in S\}) & \text{c is inferred from} \\ & \text{a clause set S} \end{cases} \tag{13}$$

where gcd() is the greatest common divider function. Observe that if c is original, then c.gcd is the greatest common divider of its own variables, and otherwise of the variables in the core of original clauses that derived it, which we will denote by core(c). This recursive defnition gives us an immediate method to implement it in a SAT solver: the base case corresponds to the original clauses, and the step to the learning that is done during confict analysis.

Given a confict clause c, we can see that for i ∈ [1, bound(n)] (bound(n) will be defned shortly), we have

$$i \cdot \frac{core(c)}{c.gcd} \subseteq \varphi\_n \,. \tag{14}$$

This is a direct result of the two observations above: From Observation 2 we know that core(c) c.gcd ⊆ φn, and from Observation 1 we know that any multiplication of this clause is a Pythagorean triple. Whether it is part of φ<sup>n</sup> depends on the value of i, which brings us to the problem of computing bound(n). To compute it, we need to know the largest variable

that participates in core(c). For each clause c, we defne recursively

$$c.maxvar = \begin{cases} \max(\{l \mid l \in c\}) & \text{c is original} \\ \max(\{c\_i.maxvar \mid c\_i \in S\}) & \text{c is inferred} \\ & \text{from a set S} \end{cases}$$

Hence, for each clause c, c.maxvar denotes the largest variable that appears in core(c). In (14) we considered clauses i · core(c) c.gcd . For these clauses to be part of φn, the following relation should hold:

$$i \cdot \frac{c.maxvar}{c.gcdcl} \le n \; .$$

Isolating i gives us the bound: bound(n) = <sup>n</sup>·c.gcd c.maxvar . Finally, observe the implication of (14): since i · core(c) c.gcd ⊆ φn, then

$$\left|\varphi\_n\right| = i \cdot \frac{c}{c.gcd}, \text{ for } i \in \left[1, bound(n)\right].\tag{15}$$

This means that i · c c.gcd can be added safely as e-clauses to φn, without removing solutions. In other words, using the terminology of Sec. II, each i ∈ [1, bound(n)] defnes us a separate mapping for a confict clause c:

$$
\sigma\_i(c) = i \cdot \frac{c}{c.gcd} \,. \tag{16}
$$

#### IV. IMPLEMENTATION DETAILS

Recall that according to (7) the formula may contain a nonempty set of clauses φ1, that cannot participate in generating e-clauses. In our implementation we mark those clauses at the beginning (such clauses are expected to be given in a separate input fle), and then also each learned clause that one of its antecedents is marked that way. For simplicity let us call these clauses *non-symmetric* and the rest *symmetric*.

To keep track of these dependencies, we altered the solver. This is a non trivial task because logical dependency between clauses is created in many different parts of a modern solver. In particular, our implementation is based on MAPLE LCM DIST CHRONOBT [20] (we will abbreviate its name to CHRONO from hereon), the winner of the SAT competition in 2018, which in itself is built on top of multiple generations of optimizations that were added to it over the years, all the way up to MINISAT-2.2 [12]. In particular, dependency is created during confict analysis in the process of learning a new clause, but also during clause minimization, binary-resolution minimization, learnt-clause simplifcations, var elimination and propagation at decision level 0<sup>3</sup> . We maintain a single bit in the header of each clause that determines whether it is symmetric or not. Since CHRONO, like all MINISAT-based solvers, do not maintain unit clauses, we maintain a separate list of variables that their value is determined at level 0 based on non-symmetric clauses.

Next, we need to maintain problem-specifc information that is necessary for deriving e-clauses. For example, for Van der Waerden formulas – see Secs. III-A – we need to keep for

<sup>3</sup>These are implemented in the following functions in CHRONO: analyze, LitRedundant, binResMinimize, simplifyLearnt, eliminateVar, propagate

each clause its gliding bounds. For the Boolean Pythagorean triples problem – see Sec. III-B – we maintain the greatest common divider (gcd) of the literals in the clause and all clauses that participated in deriving it, and the max variable in those clauses. As in the case of the symmetry bit described above, here too we need to update this information in every location in which dependency is created.

Our implementation accumulates e-clauses and then adds them to the clause database at the nearest restart. This is a different strategy than the ones mentioned in the introduction in the context of symmetric explanation learning [10] and dynamic symmetry handling [10], [25], where such clauses are added during BCP, hence affecting the current search branch (we implemented both, and the results are rather similar, with a small advantage to the technique described here). To reduce side-effects, upon adding a new e-clause we do *not* increase the counter of confict clauses, since that counter affects various other heuristics, such as the frequency of applying simplifcations and clause deletion.

The above-mentioned prior works describe various fltering methods: adding clauses only if they confict the current state or lead to further propagation, or, in the case of [25], if the confict clause itself has a low LBD. Several fltering and deletion strategies that we experimented with are described in the long version of this article [1]. Briefy, the ones we settled on as best in our experiments are (1) add an e-clause only if up to 3 literals are not false under the current partial assignment, and (2) do not add e-clauses larger than 20. As for deletion strategies, we (1) gave a separate initial activity score of 0.8 for e-clauses and (2) set the deletion ratio to 0.8, i.e., a more aggressive deletion comparing to the default of 0.5. We left this deletion ratio also for the experiments without e-clauses, for a fair comparison.

#### V. RESULTS

We implemented this method for Van der Waerden numbers and Boolean Pythagorean triples. Since there is no standard benchmark sets for these problems, we generated instances, and took all of those that can be solved with at least one confguration in less than 30 min., and with at least one confguration in more than 1 min. For the Van der Waerden problems, this resulted in 30 benchmarks (16 unsat, 14 sat). The benchmarks, full tables of results, and the implementation are available from [3]. We used the HBENCH benchmarking system [2] to conduct the experiments and data collection.

In the results tables below, timed-out benchmarks contribute the values they had at the timeout point to the various columns, other than the par-2 column, where the timeout is added twice, to be consistent with the ranking method of the SAT competitions. Our goal was mostly to measure the number of e-clauses that can be found based on isomorphic subgraphs, beyond what can be found with dynamic symmetry exploitation. We have evidence from multiple previous works, e.g., [10], [25], [24] (see Sec. I), that such clauses can help in reducing the run time. Our results below show not only that indeed many more such clauses can be generated, but also that when combined with the right fltering and deletion methods, it reduces the run time on average.

The results for the Van der Waerden problems are summarized in Table I, sorted by performance. The '-waerden' fag indicates that e-clauses are added as described in Sec. III-A. The '-dyn-sym-exploit' fag indicates that e-clauses based on dynamic symmetry exploitation were added. 'native' means that the solver was run in its default confguration other than the deletion ratio – see Sec. IV. 'static-sym-breaking' indicates that we solved the formula with static symmetry-breaking constraints, as provided by BREAKID, while the solver is in the same confguration as 'native'. For these benchmarks static symmetry breaking turns out to be better than dynamic symmetry exploitation, based on the same data (even when considering the unsat cases on their own).

On average each confict clause learned while solving these benchmarks results in over 20 e-clauses with the -waerden fag (this clearly depends on the value of n), and less than 1 with the -dyn-sym-exploit. The latter is expected, since BREAKID generates a single generator for these benchmarks (see text after Def. 3). The top part of the table does not refect these numbers, however, because it refers to runs in which we applied aggressive fltering as mentioned before. With these flters, the number of e-clauses added is typically less than 5% of the total number of clauses. Hence the potential for e-clauses is large, and perhaps future research into fltering techniques will be able to exploit this unused potential. The overhead of generating the e-clauses is marginal (the 'Overhead' column). The overhead of running BREAKID, a necessary step for applying both -dynamic-symmetry and symmetry-breaking, was a few seconds and not included in the 'Time' column.

We can see a run-time reduction of 42% comparing to a native run for the case of Van der Waerden formulas, and of 55% for the case of Pythagorean triples. In both cases the technique as described in III-A is better than adding e-clauses based on data derived from static symmetry, and better than combining these two sources of data. Cactus plots for both families appear in Figs. 2 and 3.

We also checked how active the e-clauses are in deriving new clauses. For this measure we defne as e-clauses, recursively, the set of clauses that we add directly and the clauses that were learned based on at least one e-clause premise. Activity of clauses is updated in the solver in the usual way, based on their participation in deriving other clauses. Since clause deletion is based on this activity, the ratio between the average number of 'live' clauses (i.e., that were not deleted) and the total number of learned clauses is an indication of how active they are. This ratio for e-clauses and normal confict clauses appear in the last two columns of the table. It is surprising to see that the e-clauses are more active, especially since we initiate the activity score of e-clauses with a lower value in comparison to the value given to confict clauses.

For the Boolean Pythagorean triples problem, we generated 21 satisfable instances (the frst unsatisfable instance takes weeks to solve — see [16]) with the same selection criteria


TABLE I

AVERAGE RESULTS FOR THE VAN DER WAERDEN PROBLEM, OVER 30 BENCHMARKS. TIME IS IN SECONDS. THE LAST TWO ROWS REFER TO RUNS WITHOUT ANY FILTERING OF THE E-CLAUSES.

Fig. 2. Results for the Van der Waerden benchmarks.

Fig. 3. Results for the Pythagorean-triples benchmarks.

as described above. The results appear in Table II, also in ascending performance order. Here the native solver turns out to be improved-upon in each of the confgurations, including static symmetry breaking.

#### VI. CONCLUSIONS AND FUTURE WORK

We presented a general condition for adding what we call e-clauses, right after confict analysis. We showed how this technique generalizes 'symmetry' and 'almost symmetry', and that indeed this method can add far more clauses than dynamic symmetry exploitation and related methods that are solely based on such symmetries. We showed several known problems for which this is relevant, and mentioned cases in which it was already done in the past with empirical success.

There are three lines of future work that we consider important. First, it is important to classify additional problems as having the property that they are amenable to adding eclauses, and check whether it can assist in accelerating their solving. Second, we foresee a dedicated SAT solver that maintains and reasons about *clause generators*. That is, instead of adding many e-clauses as normal clauses, just keep the base learned clause with its bounds. It can be faster than the alternative of adding all e-clauses and does not suffer from the necessity to delete most of them. In a sense, this way the eclauses are generated lazily, on demand, and then immediately erased. There are many implementation details that need to be developed for this. For example, one can add the generator to the watch list of all the literals that would have watched one of its generated e-clauses. In BCP, that literal tells us how to apply the unit implication rule to the generator. The reason clause can be maintained as a pair of a reference to the generator and an instantiation index. Many other details still need to be worked out.

A third direction, is to control the BCP order, such that it works frst on 'normal' clauses and only if it terminates without a confict, continue to propagate through the e-clauses, based on the assumption that the latter are less likely to cause a confict at the current branch. One can also envision a SAT solver that splits BCP on normal and e-clauses between two threads. A possible high-level architecture is one in which the main thread, T, works on 'normal' clauses and then on e-clauses, and the other, Te, in the other direction. The frst that fnds a confict terminates the other, or, alternatively, the solver chooses the better confict clause based on its LBD and backtracking level.


TABLE II

RESULTS FOR THE BOOLEAN PYTHAGOREAN TRIPLES PROBLEM, OVER 21 BENCHMARKS. THE BOTTOM TWO CONFIGURATIONS ARE WITHOUT FILTERING.

#### REFERENCES


*International Conference, CP 2002, Ithaca, NY, USA, September 9-13, 2002, Proceedings*, volume 2470 of *Lecture Notes in Computer Science*, pages 415–430. Springer, 2002.


# On Decomposition of Maximal Satisfiable Subsets

Jaroslav Bend´ık *Max Planck Institute for Software Systems* Kaiserslautern, Germany xbendik@mpi-sws.org

*Abstract*—In many areas of computer science, we are given an unsatisfiable formula F in CNF, i.e., a set of clauses, with the goal to analyze the unsatisfiability. A kind of such analysis is to identify Minimal Correction Subsets (MCSes) of F, i.e., minimal subsets of clauses that need to be removed from F to make it satisfiable. Equivalently, one might identify the complements of MCSes, i.e., Maximal Satisfiable Subsets (MSSes) of F. The more MSSes (MCSes) of F are identified, the better insight into the unsatisfiability can be obtained. Hence, there were proposed many algorithms for complete MSS (MCS) enumeration. Unfortunately, the number of MSSes can be exponential w.r.t. |F|, which often makes the complete enumeration practically intractable.

In this work, we attempt to cope with the intractability of complete MSS enumeration by initiating the study on *MSS decomposition*. In particular, we propose several techniques that often allows for decomposing the input formula F into several subformulas. Subsequently, we explicitly enumerate all MSSes of the subformulas, and then combine those MSSes to form MSSes of the original formula F. An extensive empirical study demonstrates that due to the MSS decomposition, the number of MSSes that need to be explicitly identified is often exponentially smaller than the total number of MSSes. Consequently, we are able to improve upon a scalability of contemporary MSS enumeration approaches by many orders of magnitude.

### I. INTRODUCTION

Boolean formulas in the Conjunctive Normal Form (CNF), wherein we are given a set F " tc1, . . . , cnu of Boolean clauses, have been widely adopted as a suitable representation language to model the behaviour of systems and properties. In case we are given an unsatisfiable CNF formula F, the goal is usually to analyze the unsatisfiability. To perform such an analysis, two concepts are often used: a *Minimal Unsatisfiable Subset* (MUS) of F, and a *Minimal Correction Subset* (MCS) of F. Intuitively, an MUS represents a minimal reason for the unsatisfiability, whereas an MCS is a minimal subset of clauses that need to be removed from F to make it satisfiable. A dual notion to an MCS is that of a Maximal Satisfiable Subset (MSS), i.e., a satisfiable subset M of F such that for every clause c P FzM the set M Ytcu is unsatisfiable. It holds that every MSS is a complement of an MCS of F and vice versa, i.e., MSSes and MCSes represent the same information.

MCSes (MSSes) find many practical applications in various areas of computer science. For instance, in the context of belief update and argumentation, MCSes are used during an update of the belief in the presence of an incoming contradictory belief [16], [21]. Similarly, in the field of diagnosis of constraint systems [5], [37], [49], MCSes represent the constraints that need to be relaxed for the system to be conflictfree. Another application of MSSes arises in the context of the maximum satisfiability problem (MaxSAT), since MSSes with the maximum cardinality correspond to the solutions of MaxSAT. Yet other applications of MCSes can be found, e.g., during model based diagnosis [7], ontology debugging, or axiom pinpointing [1].

Often, it is the case that finding just a single MCS is sufficient. However, in many applications, the task of enumerating several or even all MCSes (MSSes) is crucial for properly understanding the underlying sources of the unsatisfiability. For example, enumeration of minimal correction subsets is essential in software fault localization [30]. In the context of MaxSAT solving, a restricted MSS enumeration is effective in approximately solving the problem if finding the exact solution is intractable [41]. In the domain of diagnosis, there have been proposed many diagnosis metrics that are based on complete enumeration and counting of MSSes and MCSes (see, e.g., [26], [52]). Moreover, there are several computational problems, such as enumeration of minimal unsatisfiable subsets [37], prime implicants [28], and maximal and minimal models [39], that can be reduced to MSS enumeration.

In the past decades, there have been proposed many approaches for enumeration of MSSes (see e.g., [5], [9], [11], [22], [35], [39], [44], [51]). However, the complete MSS enumeration is still often practically intractable [11]. One of the reasons is that the identification of the individual MSSes naturally subsumes checking several subsets of F for satisfiability, and these checks are very expensive (NP-complete). Another issue is that there can be in general exponentially many MSSes of F w.r.t. the number |F| of clauses of F.

In spirit, the intractability of complete MSS enumeration is very similar to the intractability that was dealt with in the context of the Boolean model counting problem. That is, given a Boolean formula H, count all models (satisfying assignments) of H. The earliest approaches for model counting were based on a complete model enumeration, however, since the number of models can be exponential w.r.t. the number of variables of H, the complete model enumeration is often practically intractable. Fortunately, due to an extensive research in the past decades (e.g., [6], [43], [50], [53]), the model counting problem is often practically feasible even for formulas with exponentially many models. A substantial ingredient of contemporary model counters is *decomposition*; in particular, the counters are often able to decompose the input formula H into several independent sub-formulas, then count models of the sub-formulas, and multiply the sub-counts to get the model count for the whole H. At this point, one

might wonder *whether it is possibly to perform some kind of a* decomposition *in the context of MSS enumeration?*

In this paper, we initiate the study on the problem of MSS decomposition, and provide an affirmative answer to the above question. In particular, we propose two decomposition techniques that are applicable to some kinds of formulas. The first technique attempts to *directly* decompose the input formula F into several independent components (i.e., disjoint subsets of clauses) based on literals in the individual clauses. Due to the decomposition, we can first identify all MSSes of the individual components (using any existing MSS enumerator), and then form the MSSes of F by just cheaply composing the MSSes of the components. Note that the sum of the MSSes in the individual components can be exponentially smaller than the total number of MSSes of F that we obtain from the composition. The second technique is applicable when the input formula F is not *directly decomposable*. In such a case, we first attempt to identify a suitable *cut* K for F, i.e., a subset K of F such that the formula FzK can be directly decomposed. In this case, we can divide the MSSes of F into two groups: 1) MSSes that are subsets of FzK, and 2) the remaining MSSes of F. The former group can be decomposed and solved via the first decomposition technique, whereas the latter group can be identified via any existing MSS enumerator.

Based on the two decomposition techniques, we build a novel MSS enumeration algorithm and experimentally compare it with other contemporary MSS enumeration tools. Out of 1491 benchmarks, the best contemporary approach can solve only 415 benchmarks, whereas our approach solves 788 benchmarks. Moreover, whereas contemporary approaches scale only to instances with at most 10<sup>8</sup> MSSes, our approach can handle even benchmarks with 10<sup>22</sup> MSSes.

Outline. The rest of the paper is organized as follows. Section II introduces preliminaries and Section III discusses related work. The two decomposition techniques are introduced in Section IV, and our MSS enumeration algorithm is presented in Section V. Section VI provides results of our experimental evaluation. Finally, Section VII discusses practical limitations of our approach, and Section VIII concludes.

#### II. PRELIMINARIES

Standard definitions for propositional (Boolean) logic are assumed. A Boolean formula F is built over a set VarspFq of Boolean variables. A *literal* l is either a variable x P VarspFq or its negation x, and LitspFq denotes the set of all literals used in F. A *clause* c " tl1, . . . , lku is a set of literals. A Boolean formula in conjunctive normal form F " tc1, . . . , cnu, shortly *a CNF formula*, is a set of clauses. Given a CNF formula F, a *valuation* π of VarspFq is a mapping π : VarspFq Ñ t1, 0u. The valuation π *satisfies* a clause c P F iff there exists a variable x such that x P c and πpxq " 1 or x P c and πpxq " 0. Moreover, π satisfies F if it satisfies every clause c P F; such a valuation π is called a *model* of F. Finally, F is *satisfiable* if it has a model, and otherwise, F is *unsatisfiable*.

Fig. 1: Illustration of PpFq from the Example 1. We denote individual subsets of F as bit-vectors, e.g., tc1, c3u is written as 1010. The subsets with a dashed border are the unsatisfiable subsets, and the others are satisfiable subsets. The MUSes and MSSes are filled with a background color.

Throughout the whole paper, we use F " tc1, . . . , cnu to denote the input unsatisfiable CNF formula of interest. Moreover, we write just *formula* instead of *CNF formula*. Finally, given a set X, we write PpXq to denote the power-set of X, and |X| to denote the cardinality of X.

Definition 1 (MSS). *A set* N*,* N Ď F*, is a* maximal satisfiable subset (MSS) *of* F *iff* N *is satisfiable and for every* c P FzN *the set* N Y tcu *is unsatisfiable.*

Definition 2 (MCS). *A set* N*,* N Ď F*, is a* minimal correction subset (MCS) *of* F *iff* FzN *is satisfiable and for every* c P N *the set* FzpNztcuq *is unsatisfiable. Equivalently,* N *is an MCS of* F *iff* FzN *is an MSS of* F*.*

Definition 3 (MUS). *A set* N*,* N Ď F*, is a* minimal unsatisfiable subset (MUS) *of* F *iff* N *is unsatisfiable and for every* c P N *the set* Nztcu *is satisfiable.*

Note that the maximality (minimality) concept used here is a *set maximality (minimality)*, and not a *maximum (minimum) cardinality* as, e.g., in the MaxSAT problem. Consequently, there can be MSSes (MUSes) with different cardinalities, and in general, there can be up to Op2 |F | q MSSes (MUSes) of F (intuitively, there are exponentially many pair-wise incomparable subsets of F (w.r.t. the subset inclusion) and all of them can be MSSes (MUSes)). Given a formula N, we write MSS<sup>N</sup> , MCS<sup>N</sup> , and MUS<sup>N</sup> , to denote the set of all MSSes, MCSes, and MUSes of N, respectively. Moreover, given a subset K of N, we write MSS<sup>K</sup> <sup>N</sup> to denote the set of all MSSes of N that contain at least a single clause from K, i.e., MSS<sup>K</sup> <sup>N</sup> " tM P MSS<sup>N</sup> | M X K ‰ Hu.

Example 1. *We illustrate the concepts on a simple example, depicted in Figure 1. Assume that* F " tc<sup>1</sup> " tx1u, c<sup>2</sup> " t x1u, c<sup>3</sup> " tx2u, c<sup>4</sup> " t x1, x2uu*. There are two MUSes:* MUS<sup>F</sup> " ttc1, c2u, tc1, c3, c4uu*, three MSSes:* MSS<sup>F</sup> " ttc1, c4u, tc1, c3u, tc2, c3, c4uu*, and three MCSes:* MCS<sup>F</sup> " ttc2, c3u, tc2, c4u, tc1uu*.*

By the definition, MCSes are exactly the complements of MSSes, and hence finding MSSes is the same as finding MCSes. Both these concepts are used in the literature, since in some situations, it is more suitable to talk about *corrections*, and in other situations about *maximal satisfiability*. In the rest of the paper, we will stick just to the notion of MSSes and focus on the following problem:

# Problem 1. *Given an unsatisfiable CNF formula* F*, identify the set* MSS<sup>F</sup> *of all MSSes of* F*.*

When searching for MSSes of a given formula N, it is often possible to *reduce the search-space* via the concepts of *autark variables* and *lean kernel*. A set A Ď VarspNq is an *autark set* for N iff there exists a valuation of A such that every clause of N that uses a variable from A is satisfied by the valuation [42]. Note that a union of two autark sets is also an autark set, and hence there exists a unique maximum autark set of N [31], [32]. The *lean kernel* of N is the set of all clauses of N that do not contain any variable from the maximum autark set. Let L be the lean kernel of N. It is well-known that the set NzL is a subset of every MSS of N (see, e.g., [14], [31], [32]). Furthermore, the following observation holds<sup>1</sup> :

Observation 1. *Let* N *be a formula and* L *its lean kernel. Then* MSS<sup>N</sup> " tpNzLq Y M | M P MSSLu*.*

*Proof.* Let A be the autarky set that corresponds to L, and let π be a valuation of A that satisfies NzL.

Ě: Given M P MSSL, we show that pNzLq Y M P MSS<sup>N</sup> . First, note that pNzLqYM is satisfiable: since AXVarspMq " H, we can combine π with a model π <sup>1</sup> of M to get a model of pNzLq Y M. Second, by contradiction, assume that there is a clause c P LzM such that pNzLq Y M Y tcu has a model φ (i.e., pNzLq Y M R MSS<sup>N</sup> ). However, such φ is necessarily also a model of M Y tcu which contradicts that M P MSSL.

Ď: Given M<sup>1</sup> P MSS<sup>N</sup> , we show that M " M<sup>1</sup> zpNzLq P MSSL. Since M<sup>1</sup> Ě M and M<sup>1</sup> is satisfiable, then M is also satisfiable. Now, by contradiction, assume that M R MSSL, i.e., there exists c P LzM such that M Y tcu is satisfiable with a model φ. However, since VarspM Y tcuq X A " H, we can combine φ with π to get a model of M<sup>1</sup> Y tcu which contradicts that M<sup>1</sup> P MSS<sup>N</sup> .

In other words, instead of searching for MSSes of the whole N, we can just search for MSSes of the lean kernel of N. If the lean kernel is relatively small, then working just with the kernel can bring a significant runtime and memory improvement.<sup>2</sup> There have been proposed several efficient algorithms for finding maximum autarky sets and the corresponding lean kernels (see, e.g., [33], [40]).

#### III. RELATED WORK

The problem of MSS (MCS) enumeration was extensively studied in the past decades and many various techniques for the complete enumeration were proposed, e.g., [5], [11], [22], [35], [36], [39], [44], [46]–[48], [51]. Below, we just briefly describe the work-flow of contemporary approaches (for a more detailed overview, please refer to [8]).

Contemporary MSS enumeration approaches gradually explore the power-set of F; *explored subsets* are those whose satisfiability is already determined by the algorithm, and *unexplored* are the other ones. When finding each subsequent MSS M, an MSS enumeration algorithm needs to ensure two things: 1) that M is so far unexplored, and 2) that M is indeed an MSS. Both these tasks are usually carried out via several calls to a SAT solver, and these SAT solver queries are the most time-consuming part of the computation. Despite the fact that extracting just a single MSS is in FPNPrlogs [29] (i.e., requiring log |F| calls to a SAT solver), contemporary MSS enumerators usually need to perform *just* around 1-5 SAT solver calls per MSS (see [11]). Yet, in cases where the number of MSSes is relatively large (or even exponential), the overall number of SAT solver calls is still too high, which makes the complete enumeration practically intractable.

Alternatively, one can identify all MCSes (MSSes) by exploiting the so-called *minimal hitting set duality* [17], [49] between MCSes and MUSes. The duality states that every M<sup>1</sup> P MCS<sup>F</sup> is a minimal hitting set of MUS<sup>F</sup> . Hence, one can first identify the set MUS<sup>F</sup> via an MUS enumeration approach (e.g., [3]–[5], [9], [10], [12], [18], [24], [25], [35], [37], [44], [46], [51]), and then compute the minimal hitting sets of MUS<sup>F</sup> to get all MCSes of F. However, due to potentially exponentially many MUSes w.r.t. |F|, the complete MUS enumeration is also often practically intractable.

Recently, we have initiated a study [14] on the problem of counting the number |MSS<sup>F</sup> | of MSSes of a given formula F. In particular, we proposed the first MSS counting technique that does not rely on a complete explicit MSS enumeration. Briefly, given a formula F, we defined two Boolean formulas W and R such that |MSS<sup>F</sup> | " M<sup>W</sup> ´ MR, where M<sup>W</sup> and M<sup>R</sup> are the number of models of the two formulas, respectively. Therefore, we were able to determine the MSS count via two calls to a model counting tool. Crucially, contemporary model counters often need to explicitly identify just a fraction of the models, i.e., the model-counter somehow *decomposes* the task of identifying/counting MSSes. However, this decomposition is performed on the level of the model counting, whereas in this work, we propose a decomposition scheme that works natively on the structure of MSSes.

Finally, let us note that there were proposed several single MSS extractors, e.g. [2], [20], [23], [41], that are often used as subroutines of contemporary MSS enumerators. Also, there have been proposed several caching techniques, e.g. [47], [48], that can be used to speed up MSS enumerators.

#### IV. DECOMPOSITION OF MSSES

In this section, we provide several observations and propose several techniques that can be used to decompose the MSS enumeration problem into multiple easier sub-problems. Subsequently, in Section V, we utilize these techniques to build an efficient MSS enumeration algorithm.

<sup>1</sup>We believe that this observation is also well-known in the community, however, we did not find any work that explicitly formulates and proves it.

<sup>2</sup>Note that we have seen many industrial benchmarks where the lean kernel is indeed relatively small. However, there are also many industrial benchmarks where the lean kernel is the whole formula; in such cases, the extraction of the lean kernel is not useful.

Definition 4 (Decomposition Graph). *Given a formula* N*, the* decomposition graph *of* N*, denoted* GpNq*, is an undirected graph with:*


Definition 5 (Decomposition). *Given a formula* N*, the* decomposition *of* N*, denoted* DpNq*, is the set of connected components of* GpNq *(i.e.,* c1, c<sup>2</sup> P N *belong to the same component iff there exists a path between* c<sup>1</sup> *and* c<sup>2</sup> *in* GpNq*).*

Our crucial observation here is that if |DpNq| ą 1, then the problem of finding MSSes of N can be solved as follows. First, we identify the MSSes of the individual components in DpNq. Second, we compose the MSSes of the individual components via a *compositional operator* \ into MSSes of the whole N. The compositional operator and our compositional observation is formalized as follows.

Definition 6 (\). *Let* Ω " tM1, . . . ,Mpu *be a collection of sets of formulas. By* \pΩq*, we denote the set of formulas* \pΩq " tM<sup>1</sup> Y ¨ ¨ ¨ Y M<sup>p</sup> | M<sup>1</sup> P M<sup>1</sup> ^ ¨ ¨ ¨ ^ M<sup>p</sup> P Mpu*.*

Proposition 1. *Given a formula* N*, it holds that* MSS<sup>N</sup> " \ptMSS<sup>C</sup> | C P DpNquq*.*

*Proof.* Let DpNq " tC1, . . . , Cpu and assume a set M " M<sup>1</sup> Y ¨ ¨ ¨ Y M<sup>p</sup> such that M<sup>1</sup> P MSS<sup>C</sup><sup>1</sup> ^ ¨ ¨ ¨ ^ M<sup>p</sup> P MSS<sup>C</sup><sup>p</sup> .

Ě: Assuming M P \ptMSS<sup>C</sup> | C P DpNquq, we show M P MSS<sup>N</sup> . Let π1, . . . , π<sup>p</sup> be models of M1, . . . , Mp, respectively. W.l.o.g, assume that for every 1 ď k ď p and every literal l P LitspMkq such that l R LitspMkq, it holds that π<sup>k</sup> satisfies l. By Definition 4, there are no two distinct M<sup>i</sup> , M<sup>j</sup> with clauses c<sup>i</sup> P M<sup>i</sup> , c<sup>j</sup> P M<sup>j</sup> such that there exists a literal l P c<sup>i</sup> with l P c<sup>j</sup> . Consequently, for every two π<sup>i</sup> and π<sup>j</sup> it holds that they agree on common variables. Hence, we can compose π1, . . . , π<sup>p</sup> to form a model of M. To see that M is an MSS of N, assume by contradiction a clause c P NzM such that M Y tcu is satisfiable. However, this means that there exists 1 ď k ď p such that c P C<sup>k</sup> and M<sup>k</sup> Y tcu is satisfiable, which contradicts that M<sup>k</sup> is an MSS of Ck.

Ď: Assuming M P MSS<sup>N</sup> , we show M P \ptMSS<sup>C</sup> | C P DpNquq. Since M is satisfiable, then all individual M1, . . . , M<sup>p</sup> are also satisfiable. Now, by contradiction, assume an M<sup>i</sup> that is not an MSS of C<sup>i</sup> , i.e., there exists a clause c P CizM<sup>i</sup> such that M<sup>i</sup> Y tcu has a model πi . Furthermore, let π1, . . . , π<sup>i</sup>´<sup>1</sup>, π<sup>i</sup>`<sup>1</sup>, . . . π<sup>p</sup> be models of M1, . . . , M<sup>i</sup>´<sup>1</sup>, M<sup>i</sup>`<sup>1</sup>, . . . Mp. W.l.o.g, assume that for every 1 ď k ď p and every literal l P LitspCkq such that l R LitspCkq, it holds that π<sup>k</sup> satisfies l. Same as in Ě: above, we can compose π1, . . . , π<sup>p</sup> to form a model of MYtcu which contradicts that M is an MSS of N.

Example 2. *Let* N " tc<sup>1</sup> " tx1u, c<sup>2</sup> " t x1u, c<sup>3</sup> " tx2u, c<sup>4</sup> " t x2u, c<sup>5</sup> " t x1, x2u, c<sup>6</sup> " ty1u, c<sup>7</sup> " t y1u, c<sup>8</sup> " ty2u, c<sup>9</sup> " t y1, y2uu*. Here,* DpNq " tC1, C2u*, where* C<sup>1</sup> " tc1, c2, c3, c4, c5u *and* C<sup>2</sup> " tc6, c7, c8, c9u*.* MSS<sup>C</sup><sup>1</sup> " ttc2, c3, c5u, tc2, c4, c5u, tc1, c4, c5u, tc1, c3uu *and* MSS<sup>C</sup><sup>2</sup> " ttc7, c8, c9u, tc6, c8u, tc6, c9uu*. Thus, the whole* N *has 12 MSSes.*

As witnessed in Example 2, due to Proposition 1, we can substantially reduce the number of MSSes that need to be *explicitly* identified to obtain the whole set MSS<sup>N</sup> . Theoretically, it might be even the case that we need to explicitly identify just logarithmically many MSSes w.r.t. |MSS<sup>N</sup> | (assume that N contains log<sup>2</sup> |MSS<sup>N</sup> | components with 2 MSSes per component). However, from the practical point of view, how often is it the case that we can actually achieve such a reduction? And, moreover, what if |DpNq| " 1, i.e., when Proposition 1 cannot be applied? *Can we still do some decomposition when* |DpNq| " 1*?* We provide an affirmative answer to this question by finding *decomposition cuts* for N.

Definition 7 (decomposition cut). *Given a formula* N *such that* |DpNq| " 1*, a set* K Ĺ N *is a* decomposition cut *for* N *iff* |DpNzKq| ě 2*.*

Note that decomposition cuts for a formula N correspond to *graph cuts* in the decomposition graph GpNq. Our crucial observation about decomposition cuts is stated in Proposition 2 and Corollary 1.

Proposition 2. *Let* N *be a formula and* K *its subset. Then* MSS<sup>N</sup> " MSS<sup>K</sup> <sup>N</sup> Y tM P MSS<sup>N</sup>z<sup>K</sup> | @M<sup>1</sup> P MSS<sup>K</sup> <sup>N</sup> . M Ć M<sup>1</sup> u*.*

*Proof.* Let us by MSS<sup>K</sup> <sup>N</sup> denote the set of all MSSes of N that do not contain any clause from K. Clearly, MSS<sup>M</sup> " MSS<sup>K</sup> <sup>N</sup> Y MSS<sup>K</sup> <sup>N</sup> . To prove Proposition 2, we show that MSS<sup>K</sup> <sup>N</sup> " tM P MSS<sup>N</sup>z<sup>K</sup> | @M<sup>1</sup> P MSS<sup>K</sup> <sup>N</sup> . M Ć M<sup>1</sup> u.

Ď: Assume M P MSS<sup>K</sup> <sup>N</sup> , hence for all c P pNzMq the set M Y tcu is unsatisfiable, and hence M P MSSpNzK<sup>q</sup> . Furthermore, since M is an MSS of N, there cannot exist any M<sup>1</sup> P MSS<sup>K</sup> <sup>N</sup> with M Ĺ M<sup>1</sup> .

Ě: Given M P MSS<sup>N</sup>z<sup>K</sup> such that @M<sup>1</sup> P MSS<sup>K</sup> <sup>N</sup> . M Ć M<sup>1</sup> , we show M P MSS<sup>K</sup> <sup>N</sup> . By contradiction, assume that M R MSS<sup>K</sup> <sup>N</sup> , i.e., there exists c P NzM such that M Y tcu is satisfiable. Since M P MSS<sup>N</sup>z<sup>K</sup>, then c P K, however, that means that there exists M<sup>1</sup> P MSS<sup>K</sup> <sup>N</sup> such that M<sup>1</sup> Ě M Y tcu.

Corollary 1. *Let* N *be a formula and* K Ĺ N *a decomposition cut for* N*. Then* MSS<sup>N</sup> " MSS<sup>K</sup> <sup>N</sup> Y tM P \ptMSS<sup>C</sup> | C P DpNzKquq | @M<sup>1</sup> P MSS<sup>K</sup> <sup>N</sup> . M Ć M<sup>1</sup> u*.*

*Proof.* A direct consequence of Propositions 1 and 2.

Finally, let us note that graph structures similar to the decomposition graph have been already used in several MUS and MSS related studies (see e.g. the work on *model rotation* [54] or *MUS counting* [13], [15]).

#### V. DECOMPOSITION-BASED MSS ENUMERATION

In this section, we present a novel MSS enumeration algorithm that is based on the *MSS decomposition* observations introduced in the previous section. Moreover, we exploit the concept of the lean kernel which was introduced in Section II.

### *A. Main Procedure*

The main procedure of our algorithm is shown in Algorithm 1. The input is a formula F and the output is the set MSS<sup>F</sup> of all MSSes F. The computation starts by calling a procedure getKernelpFq that identifies the lean kernel L of F. Based on Observation 1, we can now restrict ourselves just to searching for MSSes of L and then *enlarge* the MSSes of L to MSSes of the whole F. To find MSSes of L, we first use a procedure getComponentspLq that determines the decomposition DpLq of L. Subsequently, we iteratively identify all MSSes of the individual components. In particular, each component N P DpLq is first checked for satisfiability via a SAT solver (denoted isSATpNq). If N is satisfiable, then N is the only MSS of N. Otherwise, we use the procedure processComponentpNq to identify all MSSes of N. We store the sets of MSSes of individual components into an auxiliary set LMSSparts. After processing all the components, we exploit Proposition 1 and build the MSSes MSS<sup>L</sup> of L by composing the MSSes of the individual components (stored in LMSSparts). Finally, based on Observation 1, we form the set MSS<sup>F</sup> of all MSSes of F by adding the complement FzL of the lean kernel L to the individual MSSes of L.

To implement the procedure getKernelpFq that identifies a lean kernel of a given formula F, we employ an approach proposed in [40]. To implement the procedure getComponentspLq that finds the decomposition DpLq of L, we build the decomposition graph GpLq and identify its connected components (any graph algorithm for finding connected components can be used). Finally, the procedure processComponentpNq is more involved and it is described in the following subsection.

### *B. Processing a Component*

The procedure processComponentpNq (Algorithm 2) starts by computing the lean kernel I of N. Then, we identify a decomposition cut K for I via a procedure findCutpIq. Subsequently, following Corollary 1, we identify all MSSes of I.

In particular, first, we employ an existing MSS enumeration algorithm, denoted getMSSespI, Kq, to identify the set MSS<sup>K</sup> I of all MSSes of I that contain at least a single clause from K. Subsequently, we use the procedure getComponentspIzKq to obtain the decomposition DpIzKq of IzK. Then, we iteratively identify all MSSes of individual components P P DpIzKq and store the sets of the MSSes into an auxiliary set IKMSSparts. Once we process all the components, we can form the MSSes of IzK as \pIKMSSpartsq (Proposition 1). Consequently, following Corollary 1, we can obtain MSS<sup>I</sup> by combining MSS<sup>K</sup> I and \pIKMSSpartsq (line 8). Finally, to obtain the MSSes of the input set N, we enlarge individual MSSes from MSS<sup>I</sup> by the set NzI (Observation 1).

The procedure findCutpIq is described in the following subsection. To conclude this subsection, we explain how to implement the procedure getMSSespA, Bq that identifies all MSS of a formula A that contain at least a single clause from a set B. When A " B (i.e., we look for all MSSes of A (line 7)), we can implement getMSSespA, Bq by an arbitrary existing MSS enumeration algorithm. In the other case, when B Ĺ A, the situation is more complicated. We are not aware of any existing MSS enumeration tool that would directly allow the user to specify sets A and B and then identify the MSSes of A that contain at least a single clause from B. However, there exist several MSS enumeration algorithms, e.g., [11], [39], that allow the user to specify a subset B<sup>1</sup> Ĺ A of *hard clauses* and then identify all MSSes of A that contain *all* clauses in B<sup>1</sup> . We observe that we can reduce the former task to the latter:

Proposition 3. *Let* A *and* B *be formulas such that* B Ĺ A*. Furthermore, let* A<sup>1</sup> " A Y tcBu *where* c<sup>B</sup> " Ť <sup>b</sup>P<sup>B</sup> b*. Then* MSS<sup>B</sup> <sup>A</sup> " tMztcBu | <sup>M</sup> <sup>P</sup> MSS<sup>t</sup>cB<sup>u</sup> <sup>A</sup><sup>1</sup> u*.*

*Proof.* Ď: If MztcBu P MSS<sup>B</sup> <sup>A</sup>, then there exists a clause c P M X B, and since MztcBu is satisfiable and c Ď cB, then also M is satisfiable. Now, by contradiction, assume that M is not an MSS of MSSA, i.e., there exists d P AzM such that M Y tdu is satisfiable, hence pM Y tduqztcBu is satisfiable (which contradicts that MztcBu P MSS<sup>B</sup> <sup>A</sup>).

Ě: If M P MSS<sup>t</sup>cB<sup>u</sup> <sup>A</sup><sup>1</sup> , then there necessarily exists a clause c Ď c<sup>B</sup> such that c P B X M. Furthermore, since M is satisfiable, then MztcBu is also satisfiable. Now, by contradiction, assume that MztcBu R MSS<sup>B</sup> <sup>A</sup>, i.e., there exists a clause d P AzpMztcBuq such that pMztcBuq Y tdu has a model π. Since c Ď cB, then π also satisfies M Y tdu which contradicts that M P MSS<sup>t</sup>cB<sup>u</sup> <sup>A</sup><sup>1</sup> .

Informally, the task of finding MSSes of A that contain at least a single clause from B can be reduced to the task of finding MSSes of A<sup>1</sup> that contain the hard clause cB. Namely, in our implementation, we employ the contemporary MSS enumeration tool RIME [11] to carry out getMSSespA, Bq.

Finally, let us note that instead of using an external MSS enumerator to implement getMSSespA, Bq, we could possibly make a recursive call of processComponentp. . .q (with some minor modifications) to get the MSSes. That is, we could recursively decompose the input formula into smaller and smaller parts. The reason why we do not do that is explained later in Observation 2. Briefly, every *usable* cut requires existence of two disjoint MUSes in the formula, and based on our empirical experience, industrial benchmarks usually do not contain many disjoint MUSes.

#### *C. Finding a Suitable Decomposition Cut*

Recall that finding a decomposition cut K for I with |DpIq| " 1 equals to finding a *graph cut* in the decomposition graph GpIq. Hence, we could use any existing algorithm for finding *cuts in a graph* to find K. However, here we need to find a *suitable* decomposition cut. In the following, we will first describe three properties of a suitable decomposition cut: *Minimality*, *Balance*, and *Necessity*. Subsequently, we describe how to find a decomposition cut with such properties.

For the ease of the presentation, assume that we identify a decomposition cut K for I such that |DpIzKq| " 2, and let us

### Algorithm 1: DecExactpFq

 L Ð getKernelpFq DpLq Ð getComponentspLq LMSSparts Ð H for N P DpLq do if isSATpNq then LMSSparts Ð LMSSparts Y ttNuu <sup>7</sup> else LMSSparts Ð LMSSparts Y tprocessComponentpNqu MSS<sup>L</sup> Ð \pLMSSpartsq return tpFzLq Y M | M P MSSLu

Algorithm 2: processComponentpNq

<sup>1</sup> I Ð getKernelpNq <sup>2</sup> K Ð findCutpIq <sup>3</sup> MSS<sup>K</sup> <sup>I</sup> Ð getMSSespI, Kq <sup>4</sup> DpIzKq Ð getComponentspIzKq <sup>5</sup> IKMSSparts Ð H <sup>6</sup> for P P DpIzKq do <sup>7</sup> IKMSSparts Ð IKMSSparts Y tgetMSSespP, Pqu <sup>8</sup> MSS<sup>I</sup> Ð MSS<sup>K</sup> <sup>I</sup> Y tM P \pIKMSSpartsq | @M<sup>1</sup> P MSS<sup>K</sup> I . M Ć M<sup>1</sup> u <sup>9</sup> return tpNzIq Y M | M P MSS<sup>I</sup> u

by C<sup>1</sup> and C<sup>2</sup> denote the two components of DpIzKq. Hence, in Algorithm 2, it holds that IKMSSparts " tMSS<sup>C</sup><sup>1</sup> , MSS<sup>C</sup><sup>2</sup> u.

Minimality Recall that in Algorithm 2, line 8, we build the set MSS<sup>I</sup> as MSS<sup>K</sup> <sup>I</sup> Y MSS<sup>K</sup> I , where MSS<sup>K</sup> <sup>I</sup> " tM P \ptMSS<sup>C</sup><sup>1</sup> , MSS<sup>C</sup><sup>2</sup> uq | @M<sup>1</sup> P MSS<sup>K</sup> I . M Ć M<sup>1</sup> u. Note that whereas the set MSS<sup>K</sup> I is computed via an external explicit MSS enumerator, i.e., relatively expensively, the set MSS<sup>K</sup> I is computed via the decomposition, i.e., relatively cheaply. Consequently, we should attempt to find a decomposition cut K such that |MSS<sup>K</sup> I | is relatively small (compared to |MSS<sup>K</sup> I |). Now, observe that since MSS<sup>K</sup> I contains the MSSes of I that include at least a single clause from K, it holds that the smaller |K| is, the smaller is the maximum possible cardinality of MSS<sup>K</sup> I . Consequently, we should minimize |K|.

Balance By Proposition 1, |\ptMSS<sup>C</sup><sup>1</sup> , MSS<sup>C</sup><sup>2</sup> uq| " |MSS<sup>C</sup><sup>1</sup> |ˆ |MSS<sup>C</sup><sup>2</sup> |. Observe that to maximize |\ptMSS<sup>C</sup><sup>1</sup> , MSS<sup>C</sup><sup>2</sup> uq| while minimizing the number |MSS<sup>C</sup><sup>1</sup> | ` |MSS<sup>C</sup><sup>2</sup> | of MSSes that are needed to build \ptMSS<sup>C</sup><sup>1</sup> , MSS<sup>C</sup><sup>2</sup> uq, we should ideally find a decomposition cut K such that |MSS<sup>C</sup><sup>1</sup> | and |MSS<sup>C</sup><sup>2</sup> | are roughly equal. However, since we do not know in advance what are the MSSes of I, we cannot (cheaply) find a decomposition cut that balances |MSS<sup>C</sup><sup>1</sup> | and |MSS<sup>C</sup><sup>2</sup> |. Instead, we will just try to find a decomposition cut such that |C1| and |C2| are roughly equal (and thus the maximal possible number of MSSes in C<sup>1</sup> and C<sup>2</sup> is roughly equal).

Necessity Note in order to ensure that |\ptMSS<sup>C</sup><sup>1</sup> , MSS<sup>C</sup><sup>2</sup> uq| ą |MSS<sup>C</sup><sup>1</sup> | ` |MSS<sup>C</sup><sup>2</sup> |, it has to hold that |MSS<sup>C</sup><sup>1</sup> | ą 1 and |MSS<sup>C</sup><sup>2</sup> | ą 1. Furthermore, observe that:

Observation 2. *Given a formula* X*, it holds that* |MSSX| ą 1 *iff* X *is unsatisfiable.*

Therefore, for a suitable decomposition cut K, it should hold that both the components C<sup>1</sup> and C<sup>2</sup> are unsatisfiable. All the above three conditions can be straightforwardly generalized for a cut K that yields more than two components.

To find a decomposition cut K with the above three properties, we build a *weighted partial MaxSAT (WPM)* [34] instance and solve it with a MaxSAT solver. In WPM, we are given a tuple pH, S, w : S Ñ N`q, where H is a set of *hard clauses*, S is a set of *soft clauses*, and w is a weight function that assigns to every soft clause a positive weight. A *solution* of the WPM is a valuation π of VarspH Y Sq such that π satisfies all hard clauses and maximizes the sum of the weights of satisfied soft clauses.

In our case, we build H Y S using two sets of Boolean variables: P " tp1, . . . , p|I|u and Q " tq1, . . . , q|I|u. Note that every valuation π of P Y Q corresponds to the subsets πP,I and πQ,I of I defined as πP,I " tc<sup>i</sup> P I | πppiq " 1u and πQ,I " tc<sup>i</sup> P I | πpqiq " 1u. Furthermore, we write π<sup>K</sup> to denote the set IzpπP,I YπQ,I q. We define a WPM instance pH, S, w : S Ñ N`q in such a way that for every one of its solutions π it holds that: 1) π<sup>K</sup> is a decomposition cut for I, and 2) the clauses in πP,I and πQ,I are disconnected in GpIzπKq, i.e., they *witness* that π<sup>K</sup> is a decomposition cut for I. To ease the presentation, we express H and S below as plain propositional formulas using the standard Boolean connectives of conjunction p^q, disjunction (\_) and implication (Ñ). One can use the Tseitin transformation to convert the formulas to sets of clauses.

The formula (hard clauses) H is divided into three subformulas, H " cut ^ unsat ^ minimal. The formula cut (Equation 1) expresses that π<sup>K</sup> is a decomposition cut, and encodes this property via two sub-formulas: disj and discn. The formula disj expresses that πP,I X πQ,I " H, whereas discn encodes that there are no two clauses c<sup>i</sup> P πP,I and c<sup>j</sup> P πQ,I such that there exists a literal l P c<sup>i</sup> with l P c<sup>j</sup> (i.e. that c<sup>i</sup> and c<sup>j</sup> are connected in GpπKq). Consequently, the clauses from πP,I and πQ,I do not belong to a same component of GpIzπKq, and hence, by Definition 7, π<sup>K</sup> is a decomposition cut for I. Note that cut does not enforce that |DpIzπKq| " 2, i.e., πQ,I and/or πP,I can be fragmented into multiple components in DpIzπKq.

$$\begin{aligned} \mathsf{cut} &= \mathsf{disj} \land \mathsf{discn}, \text{ where} \\ \mathsf{disj} &= (\bigwedge\_{c\_i \in I} \neg p\_i \lor \neg q\_i), \text{ and} \\ \mathsf{discn} &= \bigwedge\_{c\_i \in I} \left( \bigwedge\_{l \in c\_i} \left( \bigwedge\_{\substack{c\_j \in \{c\_j \in I \mid \neg l \in c\_j\}}} \neg p\_i \lor \neg q\_j \right) \right) \end{aligned} \tag{1}$$

The formula unsat (Equation 2) attempts to encode that both πP,I and πQ,I are unsatisfiable, i.e., to fulfil the Necessity condition. To ensure this property, we first attempt to identify a pair of disjoint MUSes of I, denoted by M<sup>1</sup> and M2. Equation 2 expresses that πP,I Ě M<sup>1</sup> and πQ,I Ě M2, and hence πP,I and πQ,I are unsatisfiable. To find M<sup>1</sup> and M2, we enumerate a sequence X1, X2, ... of MUSes of I using an MUS enumerator, and for each MUS X<sup>z</sup> we check whether IzX<sup>z</sup> is unsatisfiable. If there is such an MUS Xz, we use X<sup>z</sup> as M1, and we *shrink* IzX<sup>z</sup> to the MUS M<sup>2</sup> via a single MUS extractor. We enumerate only a subset of MUSes of I (limited via a user-definable time limit), and hence, we might fail to identify disjoint MUSes even if there are some. Also, it might be the case that I does not contain disjoint MUSes. In such cases, we set unsat to 1 (*True*), i.e, we do not ensure satisfaction of the Necessity condition.

$$\mathtt{unsat} = (\bigwedge\_{c\_i \in M\_1} p\_i) \land (\bigwedge\_{c\_i \in M\_2} q\_i) \tag{2}$$

The formula minimal (Equation 3) targets the Minimality condition. We express that for every c P π<sup>K</sup> the set πKztcu is not a decomposition cut for I. Note that the minimality is the minimality in the subset inclusion sense, and not in the cardinality sense. The formula states that every clause c P π<sup>K</sup> is connected (in GpIzπKq) to a clause in πP,I and to a clause in πQ,I . Consequently, adding c to πP,I (πQ,I ), i.e., flipping the assignment πppiq (πpqiq) to 1, would violate the formula discn.

$$\begin{aligned} \mathtt{minim}1 &= \bigwedge\_{c\_i \in I} (\neg p\_i \land \neg q\_i) \to \\ & \qquad \left( (\bigvee\_{l \in c\_i} (\bigvee\_{\substack{c\_j \in \{c\_j \in I \mid \neg l\}}} p\_i) \right) \land (\bigvee\_{l \in c\_i} (\bigvee\_{\substack{c\_j \in \{c\_j \in I \mid \neg l\}}} q\_i)) \end{aligned} \tag{3}$$

Finally, the *soft formula* (clauses) S " S1^S<sup>2</sup> is divided into two sub-formulas. S<sup>1</sup> (Equation 4) expresses that every c P I belongs either to πP,I or to πQ,I , i.e., that π<sup>K</sup> is empty. The weight assigned to the clauses of S<sup>1</sup> is 3 ¨ |I|, which ensures that every solution π of the WPM minimizes |πK|. Hence, S<sup>1</sup> further strengthens the Minimality condition. S<sup>2</sup> (Equation 5) attempts to fulfil the Balance condition. In particular, for every c<sup>i</sup> P I, we add two soft clauses, p<sup>i</sup> and q<sup>i</sup> , and with an equal probability (0.5) we randomly set the weights wppiq " 1 and wpqiq " 2 or vice versa. Intuitively, the formula disj enforces that at most one of p<sup>i</sup> and q<sup>i</sup> holds, and the weights for S<sup>2</sup> attempt to randomly *push* c<sup>i</sup> either towards πP,I or πQ,I .

ľ

$$\mathbf{S}\_1 = \bigwedge\_{c\_i \in I} (p\_i \lor q\_i) \tag{4}$$

$$\mathbf{S}\_2 = (\bigwedge\_{c\_i \in I} p\_i) \land (\bigwedge\_{c\_i \in I} q\_i) \tag{5}$$

Finally, let us note even if by solving the WPM we obtain a decomposition cut K such that | \ ptMSS<sup>C</sup> | C P DpIzKquq| is very large, there is no guarantee that |tM P \ptMSS<sup>C</sup> | C P DpIzKquq | @M<sup>1</sup> P MSS<sup>K</sup> I . M Ć M<sup>1</sup> u| ą 0, i.e., the decomposition might not be helpful. Therefore, the three conditions on finding a suitable decomposition cut should be seen as heuristics.

### *D. Towards Partial MSS Enumeration*

Few words are in order concerning the practical tractability of running Algorithm 2. As discussed above, the lean kernel I of the input formula N can possibly contain exponentially many MSSes. Hence the MSS enumeration might be beyond the reach of contemporary MSS enumerators (which usually perform around 1-5 SAT solver calls per MSS [8]). To cope with this intractability, we decompose I into several components, and we hope that the MSSes count for the individual components will be relatively small and thus tractable for a contemporary MSS enumerator. However, note that if there is a component which is still intractable for a contemporary enumerator (calls of getMSSesp. . .q, lines 3 and 7), then Algorithm 2 does not terminate in a reasonable time.

Here, we propose a slight modification of Algorithm 2 that deals with such an intractability. When running getMSSespA, Bq, we instruct the underlying MSS enumerator to return at most k MSSes of A, where k can be specified by the user of our algorithm. Consequently, if k is reasonably small, the calls of getMSSespA, Bq become tractable and Algorithm 2 terminates. After such a modification, the sets MSS<sup>K</sup> I and IKMSSparts might be incomplete, and thus the set MSS<sup>I</sup> formed on line 8 can be also incomplete (and hence also the overall set of MSSes returned by Algorithm 1). However, besides the incompleteness, the set MSS<sup>I</sup> might not be sound, i.e., it can contain elements that are not MSSes of I.

In particular, we add to MSS<sup>I</sup> every M P \pIKMSSpartsq such that @M<sup>1</sup> P MSS<sup>K</sup> I . M Ć M<sup>1</sup> . Provided that MSS<sup>K</sup> I is complete, passing the check @M<sup>1</sup> P MSS<sup>K</sup> I . M Ć M<sup>1</sup> ensures that M is an MSS of I (Proposition 2). However, if MSS<sup>K</sup> I is incomplete, then 1) every M that *does not pass* the check *is not* an MSS of I, and 2) every M that *does pass* the check *can be* an MSS of I. Thus, in the case when MSS<sup>K</sup> I is incomplete, we first check for every M whether it satisfies @M<sup>1</sup> P MSS<sup>K</sup> I . M Ć M<sup>1</sup> , and if yes, then we also verify that M is an MSS of I using a SAT solver. Such a verification can be performed using a single call of a SAT solver [14] (we check whether <sup>M</sup> ^ p<sup>Ž</sup> <sup>c</sup>PIz<sup>M</sup> cq is satisfiable).

#### VI. EXPERIMENTAL EVALUATION

We have implemented our novel approach for MSS/MCS enumeration in a python-based tool using the MSS enumerator RIME [11] to implement the procedure getMSSes, the library PySAT [27] for maintaining CNF formulas, Minisat [19] (accessed via PySAT) as a SAT solver, and UWrMaxSat [45] as a MaxSAT solver. The tool is available at:

#### https://github.com/jar-ben/MSSDecomposition

Here we provide results of our experimental evaluation. We write DecExact to denote the *complete* MSS enumeration approach as described in Algorithms 1 and 2, and DecApprox to denote the *partial* MSS enumeration version as described in Section V-D. For DecApprox, we set the parameter k to 100000, i.e., every call of getMSSes identifies at most 100000 MSSes. Moreover, we evaluate three contemporary MSS/MCS enumeration algorithms: MARCO<sup>3</sup> [36], FLINT<sup>4</sup> [44], and RIME<sup>5</sup> [11]. In all cases, we used the original implementations of the algorithms with their best (default) settings.

As benchmarks, we used a collection of 1491 Boolean CNF formulas that were used in several recent MSS or MUS related studies. Out of the 1491 formulas, 1200 instances<sup>6</sup> are randomly generated formulas that were first used in [38], and the remaining 291 benchmarks were taken from the MUS track of the SAT Competition 2021<sup>7</sup> . The former benchmarks contain from 100 to 1000 clauses, use from 50 to 996 variables, and have from 2 to at least 10<sup>22</sup> MSSes (the highest MSS count revealed in our evaluation). The latter benchmarks contain from 70 to 16 million clauses, use from 26 to 4.4 million variables, and have from 2 to at least 10<sup>8</sup> MSSes. We run all experiments on an AMD EPYC 7371 16-Core Processor, 1 TB memory machine running Debian Linux. We used 20 GB memory limit and 3600 seconds (1 hour) time limit per benchmark.

#### *A. Research Questions*

We focus on answering the following research questions.


#### *B. RQ1: Number of Solved Benchmarks*

In Figure 2, we show the number of benchmarks for which individual algorithms finished their computation (within the time limit). In particular, a point with coordinate rx, ys means that there are x benchmarks that were finished by the algorithm in at most y seconds. FLINT, RIME, and MARCO were able to identify all MSSes *only* for 364, 376, and 415 benchmarks, respectively. On the other hand, DecExact identified all MSSes for 788 benchmarks, i.e., solving two times as many benchmarks as its competitors. Finally, DecApprox finished the computation for 1240 benchmarks, however, in many cases, it identified only a portion of all MSSes (due to the limit of

<sup>7</sup>http://www.satcompetition.org/

Fig. 2: Number of solved benchmarks.

Fig. 3: Scalability w.r.t. the MSS Count

100000 MSS per getMSSes call). In particular, DecApprox identified all MSSes for 742 benchmarks, and at least some MSSes for 498 benchmarks.

We observed that the tractability of the benchmarks highly correlates with their size (number of clauses). In particular, there are only 16 benchmarks that contain more than 10000 clauses and were solved by at least one of the tools (excluding the incomplete tool DecApprox). Moreover, FLINT, RIME, and MARCO scale better w.r.t. this criterion than DecExact since there are 10 benchmarks that contain more than 500000 clauses (but only up to 20000 MSSes) and were solved by these tools. On the other hand, the largest benchmark solved by DecExact contains only 13236 clauses. We further discuss this bottleneck of our approach in Section VII.

#### *C. RQ2: Scalability W.R.T. the MSS Count*

In Figure 3, we compare the scalability of the evaluated algorithms w.r.t. the number of MSSes in the input formulas. In particular, a point with coordinates rx, ys denotes that there are x benchmarks where the corresponding algorithm identified fewer than y MSSes. You can see that MARCO and RIME were able to identify at most only around 10<sup>6</sup> MSSes. FLINT performed slightly better w.r.t. this criterion since for some benchmarks, it identified around 10<sup>8</sup> MSSes. On contrary, both DecExact and DecApprox were able to identify up to 10<sup>22</sup> MSSes in a benchmark. This witnesses that the use our MSS decomposition techniques allow us to substantially improve the scalability of existing approaches.

<sup>3</sup>https://sun.iwu.edu/"mliffito/marco/

<sup>4</sup>The implementation of FLINT was kindly provided to us by its author, Nina Narodytska.

<sup>5</sup>https://github.com/jar-ben/rime

<sup>6</sup>https://github.com/luojie-sklsde/MUS Random Benchmarks

Fig. 4: The ratio between the total number of MSSes and the number of explicitly identified MSSes.

#### *D. RQ3: Number of Explicitly Identified MSSes*

Finally, the third research question concerns just our two algorithms, DecExact and DecApprox. Given a formula F, we examine the ratio tc ex , where tc is the total number of identified MSSes of F (i.e., |MSS<sup>F</sup> | and an under-approximation of |MSS<sup>F</sup> | for DecExact and DecApprox, respectively) and ex is the number of MSSes identified via the calls of getMSSes. A point with coordinates rx, ys in Figure 4 denotes that for the corresponding algorithm, there are x benchmarks where the ratio was at least y. Note that we show the ratio only for the 788 and 1240 benchmarks where DecExact and DecApprox finished the computation.

Recall that getMSSes is implemented via an *explicit MSS enumerator*, i.e., it identifies individual MSSes one by one using sequence of SAT solver calls, i.e., identification of these MSSes is the most expensive part of our algorithm(s). On the other hand, the tc MSSes are identified extremely cheaply since they are built by just composing the MSSes identified via getMSSes. Therefore, the ratio tc ex actually represents the (maximum possible) speed-up of the MSS enumeration when using DecExact and DecApprox compared to using the *explicit enumerators* FLINT, MARCO, and RIME.

#### VII. LIMITATIONS AND PRACTICAL APPLICABILITY

Even though our novel approaches, DecExact and DecApprox, solved in our evaluation substantially more benchmarks than contemporary MSS enumerators, the practical efficiency of our approaches remains to be unclear. Here, we discuss two main bottlenecks of our approaches and propose ways how to deal with them.

The first bottleneck of our MSS decomposition technique is its reliance on a MaxSAT solver (which is used to find a suitable cut). The size of the formula cut (Equation 1) depends on the number |F| of clauses in the input formula F. Hence, for larger input formulas F, solving the MaxSAT problem for cut easily becomes practically intractable. A possible way how to deal with this limitation is to use just an *approximate* MaxSAT solver. In particular, recall that our approach for finding a suitable cut via the formula cut is just a heuristic, i.e., there is no guarantee that it will indeed find a suitable cut. Using an approximate MaxSAT solver instead of an exact one might increase the scalability of our approach w.r.t. |F|.

The second bottleneck of our MSS decomposition technique was stated in Observation 2. In particular, recall there exists a usable cut for a given formula F only if F contains a disjoint pair of MUSes. Based on our empirical experience, there are many applications where the input formula does not contain a disjoint pair of MUSes and hence our approach cannot be applied. Yet, we have also witnessed many industrial benchmarks where disjoint MUSes naturally appear (for instance, there is a SAT encoding of the graph coloring problem where disjoint MUSes correspond to disjoint non-colorable subgraphs). Hence, one might initially check whether the input formula F contains disjoint MUSes and employ our approach only if it is the case.

#### VIII. CONCLUSION AND FUTURE WORK

In this paper, we focused on the problem of enumeration of Maximal Satisfiable Subsets of a given CNF formula F. Despite the fact that the enumeration problem was extensively studied in the past decades, contemporary enumerators are still often unable to finish the computation within a reasonable time limit. The problem is that there can be up to exponentially many MSSes w.r.t. |F| and contemporary approaches usually need to perform a sequence of SAT solver queries to obtain individual MSSes. To combat the combinatorial explosion, we proposed a novel MSS enumeration approach that decomposes F into several smaller sub-formulas, identifies their MSSes, and then compose the MSSes of the sub-formulas to form MSSes of the whole F. Our experimental evaluation witnessed that the decomposition in some cases allows us to identify exponentially more MSSes than other contemporary approaches. Yet, as described in Section VII, the class of benchmarks where our approach can be applied is limited.

We see several directions for future work. A crucial ingredient of our algorithm is the ability to identify a suitable decomposition cut K. The approach for finding K we proposed seems to be quite good, i.e., indeed allowing for a decomposition. However, we believe that there might be even better approaches how to find a suitable decomposition cut. Another direction for future work would be to improve upon the partial MSS enumeration approach (DecApprox). In particular, instead of limiting the number of MSSes returned by getMSSes, one might try to either interleave or parallelize the computation of MSSes of individual components and compose the MSSes on-the-fly. Finally, since our approach is applicable only to a specific class of benchmarks, it might be worth building a portfolio approach.

#### ACKNOWLEDGEMENT

This research was funded in part by the Deutsche Forschungsgemeinschaft project 389792660-TRR 248 and by the European Research Council under the Grant Agreement 610150 (ERC Synergy Grant ImPACT).

#### REFERENCES


# Designing Samplers is Easy: The Boon of Testers

Priyanka Golia *Indian Institute of Technology Kanpur National University of Singapore*

Mate Soos *National University of Singapore*

Sourav Chakraborty *Indian Statistical Institute, Kolkata*

Kuldeep S. Meel *National University of Singapore*

*Abstract*—Given a formula ϕ, the problem of uniform sampling seeks to sample solutions of ϕ uniformly at random. Uniform sampling is a fundamental problem with a wide variety of applications. The computational intractability of uniform sampling has led to the development of several samplers that heavily rely on heuristics and are not accompanied by theoretical analysis of their distribution. Recently, Chakraborty and Meel (2019) designed the first scalable sampling tester, Barbarik, based on a grey-box sampling technique for testing if the distribution, according to which the given sampler is sampling, is close to the uniform or far from uniform. While the theoretical analysis of Barbarik provides only unconditional soundness guarantees, the empirical evaluation of Barbarik did show its success in determining that some of the off-the-shelf samplers were far from a uniform sampler.

The availability of Barbarik has the potential to spur development of samplers techniques such that developers can design sampling methods that can be accepted by Barbarik even though these samplers may not be amenable to a detailed mathematical analysis. In this paper, we present the realization of this aforementioned promise. Based on the flexibility offered by CryptoMiniSat, we design a sampler CMSGen that promises the achievement of sweet spot of the quality of distributions and runtime performance. In particular, CMSGen achieves significant runtime performance improvement over the existing samplers. We conduct two case studies, and demonstrate that the usage of CMSGen leads to significant runtime improvements in the context of combinatorial testing and functional synthesis.

A salient strength of our work is the simplicity of CMSGen, which stands in contrast to complicated algorithmic schemes developed in the past that fail to attain the desired quality of distributions with practical runtime performance.

#### I. INTRODUCTION

Given a formula ϕ, the problem of uniform sampling seeks to sample solutions of ϕ uniformly at random. Uniform sampling has emerged as an essential technique in the context of constrained-random simulation [33], constraint-based fuzzing [5], [19], [22], configuration testing [13], [23], bug synthesis [36], and the like. For example, in the context of constrained-random simulation, uniform sampling is employed to generate test cases that satisfy the set of constraints encoding domain knowledge from sources such as designers, endusers, and the like.

The widespread applications of uniform sampling have led to several algorithmic proposals over the years with varying theoretical guarantees and empirical scalability. Chakraborty, Meel, and Vardi introduced the first practical almost-uniform sampler, UniGen [11], [12], which has since been improved to UniGen3 [9], [39]. Recently, Sharma et al. proposed a knowledge compilation-based approach [37], called KUS, that can perform uniform sampling. While UniGen3 and KUS can scale to hundreds of thousands of variables for some problems, their performance still falls short of the desired scale for some real-world instances. The need for scalability has led to the development of several tools that seek to achieve scalability at the cost of theoretical guarantees. The underlying techniques for such tools cover a broad spectrum ranging from adapted BDD-based techniques [26], random seeding of DPLL-based SAT solvers [32], Markov Chain Monte Carlo-based (MCMC) methods [24], [43], interval propagation and belief networksbased methods [14], [20], MaxSAT-based techniques [16].

The lack of guarantees for various samplers leads their designers to illustrate the quality of samples generated via computation of statistics for generated distributions over a small set of benchmarks. Such demonstrations, however, do not generalize to many classes of benchmarks, and it is often the case that subsequent studies tend to demonstrate cases where previously proposed samplers generate distributions far away from uniform. While the theoretical guarantees of uniformity can be viewed as a holy grail, much of the software engineering progress owes to the development of testing methodologies. These methodologies employed both to validate the system and find bugs by the developers themselves in the form of test-driven development (TDD) and to build trust with the end-users; all without requiring the developers to supply a formal proof of correctness.

A major contributing factor to the dramatic improvement in the robustness and scalability of SAT solvers has been the development of the DRAT proof format and associated proof checker drat-trim [44]. The availability of drat-trim allows SAT solver developers to find bugs that would be hard to discover owing to the complex architecture of state-of-the-art SAT solvers. While the problem of checking whether a given formula is UNSAT is *merely* Co-NP, the problem of testing whether a sampler is a uniform requires Ω(2<sup>n</sup>) samples given black-box access to the sampler [3], [8], where n is the number of variables.

Recently, Chakraborty and Meel proposed the first scalable sampler test framework, Barbarik [8]. This framework distinguishes whether the distribution generated by the given sampler is ε-close to uniform (Accept) or η-far from uniform (Reject), while the number of samples required depends only

on ε and η, and is independent of n. The core idea of the Barbarik is to reduce testing of uniformity over the entire solution space of ϕ to the testing of uniformity over solutions space of another formula, ϕˆ constructed over two randomly chosen solutions of ϕ (observe that ϕˆ → ϕ). The subroutine to construct ϕˆ is called Kernel. The analysis of Barbarik states that if Barbarik Rejects a sampler, the distribution generated by sampler is indeed (probabilistically) far from uniform, but if Barbarik Accepts a sampler, the sampler's distribution is close to uniform under the assumption of *non-adversality* with respect to Kernel. Informally, the *non-adversality* assumption with respect to Kernel dictates that given ϕ, the conditional distribution of the sampler over the solutions of ϕˆ is same as the distribution of the sampler with ϕˆ as input. Note that this allows some samplers to behave in an *adversarial* manner, i.e., such samplers may not generate uniform distribution over ϕ, however may generate uniform distributions for ϕˆ. In such a case, causing Barbarik will return Accept for such samplers. At this point, it is worth remarking that given the strong lower bounds on black-box testing, the usage of such an assumption is a *practical* necessity.

Empirically, Barbarik was able to return Reject for all the state of the art samplers without rigorous mathematical analysis certifying (almost)-uniformity of the generated distributions. In particular, Barbarik was demonstrated to Accept UniGen3 while rejecting the state of the art samplers STS [18] and QuickSampler [16]. It is worth noting that the three samplers, UniGen3, QuickSampler, and STS, were found to be statistically indistinguishable by the usage of simple metrics such as KL-divergence [27] after a small number of samples.

The availability of Barbarik, however, has potential to allow development of samplers, whose algorithmic frameworks may not be amenable to mathematical analysis but can be accepted by Barbarik. The primary contribution of this paper is realization of the promise of Barbarik via development of a new state of the art sampler, CMSGen. In particular, we make following contributions:

### *A.* CMSGen*: A State of the Art Sampler*


significantly improves upon UniGen3 in terms of runtime performance.

# *B. Case Studies: Combinatorial Testing and Functional Synthesis*

3) At this point, one may wonder whether there are practical applications of CMSGen. We next focus on applications that are beyond the reach of UniGen3, and for such cases, one has to rely on the heuristics-based samplers. In particular, we perform two case studies: (1) combinatorial testing, and (2) functional synthesis; two problems with a long history of sustained interest in formal methods and software engineering community. For both the case studies, we observe that the usage of CMSGen leads to significant performance improvements in comparison to usage of other competing samplers UniGen3 and Quick-Sampler.

It is worth remarking that a salient strength of CMSGen is the simplicity of its design. We find it exciting that a sampler with such a simple design could outperform sophisticated state of the art samplers. Based on our empirical analysis, one would remark that CMSGen aims to achieve the sweet spot of scalability and uniformity. In particular, CMSGen is significantly more scalable than samplers with guarantees and, at the time, achieves distributions of higher quality than samplers without guarantees. The runtime performance combined with the quality of distribution as certified by Barbarik makes CMSGen the ideal choice for applications such as combinatorial testing and functional synthesis where scalability and quality of distribution are equally crucial.

The rest of the paper is organized as follows: In Section II, we present the formal definitions and also present a brief description of the sampler verifier Barbarik. In Section III we present the new sampler CMSGen and in Section IV we present the evaluation of CMSGen both by comparing its runtime performance with other samplers and also its performance against Barbarik. Then in Section V we demonstrate the usefulness of CMSGen with two case studies on problems of fundamental importance to formal methods community: functional synthesis and combinatorial testing. Finally, we conclude in Section VI.

#### II. NOTATION AND BACKGROUND

A literal is a Boolean variable or its negation. Let ϕ be a Boolean formula in conjunctive normal form (CNF), and let X be the set of variables appearing in ϕ. The set X is called the *support* of ϕ, denoted by Supp(ϕ). Given an array a, a[i : j] represents the sub-array consists of all the elements of a between indices i and j. A *satisfying assignment* or *witness*, denoted by σ, is an assignment of truth values to variables in its support such that ϕ evaluates to true. A satisfying assignment is also represented as a set of literals. For S ⊆ Supp(ϕ), we use σ<sup>↓</sup><sup>S</sup> to indicate the projection of σ over the set of variables S. We denote the set of all witnesses of ϕ as sol(ϕ). For notational convenience, whenever the formula

<sup>1</sup>Available at https://github.com/msoos/cryptominisat

ϕ and/or the set S ⊆ Supp(ϕ) is clear from the context, we omit mentioning them.

### *A. Samplers*

*Definition 1:* Given a Boolean formula ϕ, a *CNF-sampler* (or simply *sampler*) G of ϕ is a probabilistic algorithm that generates a random element in sol(ϕ). We will assume that a sampler takes as input a CNF-formula ϕ, a set S ⊆ Supp(ϕ) and an integer k. It generates k elements σ1, . . . , σ<sup>k</sup> from sol(ϕ) and outputs σ<sup>1</sup>↓S, . . . , σ<sup>k</sup>↓S. When the integer k and the set S ⊆ Supp(ϕ) is clear from the context (or is not important) we will drop them and use G(ϕ) or G(ϕ, S) to denote the sampler.

We use pG(ϕ, σ) (or pG(ϕ, σ, S)) to denote the probability that G(ϕ, ·, ·) (or G(ϕ, S, ·)) generates σ (or σ<sup>↓</sup>S). And, we use D<sup>G</sup>(ϕ) (and D<sup>G</sup>(ϕ,S) ) to denote the distribution induced by G over the set sol(ϕ) (and sol(ϕ)<sup>↓</sup>S). For a set T ⊆ sol(ϕ), we use D<sup>G</sup>(ϕ)↓ T to denote the distribution D<sup>G</sup>(ϕ) conditioned on set T.

*Definition 2:* Given a Boolean formula ϕ, A *uniform sampler* G u (ϕ) is a sampler that given ϕ guarantees

$$\forall y \in sol(\varphi), \Pr\left[\mathcal{G}^u(\varphi) = y\right] = 1/|sol(\varphi)|,\qquad \text{(l)}$$

*Definition 3:* Given a Boolean formula ϕ and tolerance parameter ε, G AAU (ϕ, ε) is an additive *almost-uniform generator* (AAU) if the following holds:

$$\forall y \in sol(\varphi), \frac{1-\varepsilon}{|sol(\varphi)|} \le \Pr\left[\mathcal{G}^{AAU}(\varphi, \varepsilon) = y\right] \le \frac{1+\varepsilon}{|sol(\varphi)|}\tag{2}$$

A sampler is allowed to occasionally "fail" in the sense that no element may be returned even if sol(ϕ) is non-empty. The failure probability for such generators must be bounded by a constant strictly less than 1.

*Definition 4:* Given a Boolean formula ϕ and an intolerance parameter η an generator G(ϕ, .) is η-far from uniform generator if the `1-distance (or, twice the variation distance) of D<sup>G</sup>(ϕ) from uniform is at least η. That is,

$$\sum\_{x \in sol(\varphi)} \left| p\_{\mathcal{G}(\varphi, x)} - \frac{1}{|sol(\varphi)|} \right| \ge \eta$$

#### *B. Sampler Tester*

Given a sampler G, one would like to test if the sampler is indeed correct. Or in other words, one would like to test the following:


While the first point is very easy to test, testing the second point is quite challenging. Standard verification techniques or black box sampling techniques would need exponential time/samples and thus are very inefficient.

Chakraborty and Meel [8] designed the tester Barbarik that would accept if the sampler is an additive almost-uniform generator on any input and reject if the sampler is far from a uniform generator on some input under certain assumptions discussed below. The idea of Barbarik comes from the world of property testing, where the sample complexity for testing whether a distribution is a uniform is studied. While it was known from classical sample complexity [3] that an exponential number of samples are required to distinguish a uniform distribution from a distribution that is η-from uniform, in [7] it was observed that if given access to conditional samples only a constant number of samples suffice. Conditional samples from a distribution D means for a subset T of the domain Ω, drawing samples from the conditional distribution D<sup>|</sup><sup>T</sup> . The algorithm for checking whether a given distribution D over domain Ω is uniform or η-far from uniform, consists of following steps:


The last point of the above algorithm can be performed using only a constant number of conditional samples. It can also be shown that the above algorithm, with non-trivial probability, will Accept if D is uniform and Reject if D is ηfar from uniform, by repeating this algorithm a certain number of times, one can boost the success probability.

While the algorithm is theoretically interesting, applying it to design a sampler test framework required several hurdles to cross. Firstly, for Step 2 of the algorithm, one needs to run a uniform sampler. This is not too much of a hurdle as one can use a non-efficient uniform sampler, since the sampler tester is only to be used a few times to certify if a sampler is good.

The second problem is that the algorithms, as such, could only distinguish between a uniform distribution, and a distribution "far" from a uniform distribution, while a sample tester should also Accept samplers that are "close" to uniform samplers (and not necessarily just uniform samplers).

Finally, the main concern was how to obtain conditional samples. In [8] this was achieved by constructing a new formula ϕˆ on a larger number of variables such that the satisfying assignments of ϕˆ restricted to the original set of variables is either σ<sup>1</sup> and σ2. In fact if S = Supp(ϕ), then

$$\Pr\_{\sigma \sim \mathcal{U}(sol(\hat{\varphi}))}[\sigma\_{\downarrow S} = \sigma\_1] = \Pr\_{\sigma \sim \mathcal{U}(sol(\hat{\varphi}))}[\sigma\_{\downarrow S} = \sigma\_2] = \frac{1}{2}$$

where U(sol( ˆϕ) denotes uniform distribution over sol( ˆϕ) The new formula ϕˆ is obtained from ϕ by using a subroutine Kernel that uses the chain formula technique from [10].

The goal of the construction of ϕˆ is such that the following two conditions are satisfied:


Now, if the sampler G is additive almost-uniform generator on any input ϕ the first condition would be satisfied. But for the second condition to hold some more assumptions are necessary. This assumption is called the *non-adversarial assumption* in [8].

*Definition 5:* The non-adversarial sampler assumption states that if ( ˆϕ, Sˆ) is the output obtained from Kernel(ϕ, S, σ1, σ2, N) then


Thus Barbarik has the following guarantees.

*Theorem 1:* Given a sampler G, tolerance parameter , intolerance parameter η and correctness parameter δ,


For the implementation, the subroutine Kernel is designed in an attempt to fool the sampler into satisfying the *nonadversarial assumption*. The idea being that the new CNFformula ϕˆ would be "hard" to distinguish from ϕ and hence one would expect

$$p\_{\mathcal{G}}(\hat{\varphi}, \sigma\_1, S) = \frac{p\_{\mathcal{G}}(\varphi, \sigma\_1, S)}{p\_{\mathcal{G}}(\varphi, \sigma\_1, S) + p\_{\mathcal{G}}(\varphi, \sigma\_2, S)}$$

#### *C. Experimental Setup*

All our experiments were conducted on a high-performance computer cluster with each node consisting of a E5−2690 v3 CPU with 24 cores and 96GB of RAM, with a memory limit set to 4GB per core.

### III. FROM CryptoMiniSat TO CMSGen

The naive technique to design a sampler is to pick a random assignment of variables, check if it satisfies the CNF formula, and, if so, output the assignment as a witness; otherwise, pick another random assignment and start over again. Using an unbiased random coin for the assignments, it is trivial to see that the technique leads to a uniform sampler. Such a proposal is, however, very inefficient as with a very high probability, every picked assignment is likely not to satisfy the formula.

One way to make such a sampler into an efficient one is by not starting with a complete assignment but build the partial assignment up the variable by variable, set all variables that are implied by the current partial assignment, and if a partial assignment is incorrect, record and learn from the failure. The concept of learning from failure is captured by the well-known conflict-driven clause-learning (CDCL) framework used by most state-of-the-art SAT solvers. We refer the reader to Chapter 4 of [4] for a detailed exposition on CDCL. We present an extension that seeks to combine the CDCL framework with randomization in the choice of partial assignments in Algorithm 1, called UniformLikeWitness. UniformLikeWitness is essentially a randomized variation on the CDCL framework, with a randomized heuristic for what variable to assign next, a randomized heuristic for variable polarities, and without restarts.



One major problem of the above process is that the sampler, just like an SAT solver, may get stuck in the corner of the space where there are no satisfying solutions. Once stuck, it can take much time to record the relevant conflicts before it can escape this part of the search space. In modern SAT solvers, such an escaping is enabled by performing restarts. The idea of a restart is to stop the current search procedure, keeping conflict clause and heuristic data such as polarities, variable activities in the line, but otherwise starting afresh, resetting the assignment state. The idea of performing a restart is to reduce the chance of getting stuck in a non-fruitful part of the search space. Performing regular, frequent restarts is a core component of all state-of-the-art SAT solvers.

CMSGen <sup>2</sup> is a sampler that exploits the flexibility CryptoMiniSat to implement the behaviour of UniformLikeWitness. We use the restart policy based on the number of conflicts, i.e., we perform a restart after the predetermined number of conflicts, which is set to 100. Hence, the final set of options passed to CryptoMiniSat turn off the features unrelated to CDCL (such as bounded variable elimination [17], local search [6], or symmetry breaking [15]), and set the options that control variable branching and polarity picking to match Algorithm 1, and set the restart interval to 100. Note that while it is possible that other CDCL SAT solvers could be adjusted to generate samples as well as CMSGen, the newer and more performant glucose-based SAT solvers [2] tend to be highly tuned without any command-line options to change or turn off heuristics.

We would like to emphasize that we do not claim that CMSGen is expected to generate uniform distributions over all the formulas as it is possible to construct worst case scenarios where CMSGen would not work well. At this point, it is worthwhile to note that, to the best of our knowledge, the current techniques are insufficient to analyse the kind of formulas for which UniformLikeWitness would behave like

<sup>2</sup>CMSGen is available at https://github.com/meelgroup/cmsgen

a uniform sampler given their limitations to understand the behaviour of CDCL itself. Traditionally, the proposal of a new sampler is accompanied by theoretical analysis, but in our case, we seek to rely on the testing framework of Barbarik to analyse the behavior of CMSGen.

#### IV. THE POWER OF CMSGen

As mentioned above, instead of taking a conventional route focusing on the theoretical analysis of CMSGen, we seek to employ Barbarik to test whether CMSGen is a uniform sampler or not. In addition, we seek to understand the runtime behavior of CMSGen in comparison to other state of the art techniques. We conducted an extensive evaluation of diverse public domain benchmarks employed in prior studies [8], [40].

A comment on the choice of benchmarks for the two studies: For the first study, we selected the same 50 benchmarks that were employed in the evaluation of Barbarik so as to situate the results with prior context [8]. Since Barbarik needs to sample up to 1.835 × 10<sup>3</sup> solutions, the choice of benchmarks in [8] was restricted to instances for which generating samples is easy. On the other hand, these benchmarks are not meaningful for runtime performance comparison as all the tools finish on them very quickly. To this end, we relied on 70 benchmarks employed in prior sampling studies [38], [39] for runtime performance comparison.

The objective of our evaluation was two-fold:


In summary, we observe that Barbarik, somewhat surprisingly, returns Accept for CMSGen and UniGen3 on all the 50 instances while returning Reject for all the 50 instances for QuickSampler [16], and for 36 instances for STS [18], the state-of-the-art samplers without guarantees. At the same time, comparison in terms of runtime for over 70 benchmarks arising from different application domains, we observe that CMSGen is significantly faster than UniGen3.

#### *A. Testing* CMSGen *with* Barbarik

For experimentation evaluations with Barbarik, we used the default parameters suggested by the authors: In particular, we set tolerance parameter , intolerance parameter η, and confidence δ to be 0.3, 1.8, and 0.1 respectively. For our chosen parameters, the number of samples required to return Accept for a given sampler under test is 1.836 × 10<sup>3</sup> , and to maintain consistency with evaluation setup of Barbarik, we selected benchmarks (50 in total) that were used in evaluation of QuickSampler and UniGen3 for which Barbarik terminates within 2 hours. To test uniformity of distributions generated by CMSGen and other samplers, we employed Barbarik augmented with SPUR [1] as the underlying uniform sampler. We present the results of our evaluation in Table I, where the four columns present results corresponding to QuickSampler,

TABLE I: Analysis of different samplers with Barbarik over 50 benchmarks. Parameters : 0.3, η : 1.8, δ : 0.1, and samples required to return Accept 1.836 × 10<sup>3</sup> .


STS, UniGen3, and CMSGen respectively. The first and second rows indicate the number of instances for which Barbarik returned Accept and Reject respectively. We first note that while Barbarik returned Reject for QuickSampler and STS for the 50 and 36 instances respectively, it returned Accept for both CMSGen and UniGen3 for all the instances. It is worth highlighting that UniGen3 provides guarantees of almostuniformity.

*Remark 1:* At this point, it is worth highlighting that we arrived at the choice of parameters of CMSGen, such as when to restart via an iterative process where we would run Barbarik for the given choice of parameters and change them based on the number of instances rejected by Barbarik. In this context, it is rather encouraging that such an iterative process led us to design a sampler, CMSGen, which could not be distinguished from UniGen3 by Barbarik while significantly improving upon UniGen3 in terms of runtime performance. This highlights the advantages of a TDD-style design approach.

#### *B. Runtime Comparison*

Upon observing that Barbarik returns Accept for all the 50 instances for both CMSGen and UniGen3, a natural question is whether the runtime performance of CMSGen is comparable to that of UniGen3. To this end, we compared CMSGen with UniGen3, STS and QuickSampler on 70 benchmark instances arising from a wide range of application areas of uniform sampling, such as probabilistic reasoning, Bounded Model Checking [37], [40]; these instances had been previously employed in empirical studies focused on the comparison of sampling techniques [38], [39].

For each of the instances, we invoke each of the sampler to generate 1000 solutions within a timeout of 7200 seconds. Figure 1 shows the cactus plot for CMSGen, UniGen3, STS and QuickSampler. We present the number of benchmarks on the x-axis and the time taken on the y-axis. A point (x, y) implies that for a x benchmark, the sampler took less than or equal to y seconds to generate 1000 solutions of x. With a timeout of 7200 seconds, UniGen3 and CMSGen were able to sample 1000 solutions of 51 and 52 benchmarks respectively, whereas STS and QuickSampler generated samples for *merely* 37 and 33 instances respectively. Figure 1 clearly shows that for all the benchmarks that were sampled 1000 times by both UniGen3 and CMSGen, CMSGen outperformed UniGen3 with a geometric speedup of over 420×.

Table II represent the runtime performance for QuickSampler, STS, UniGen3 and CMSGen for a representative set of 20 benchmarks. As shown in Table II, there are instances (18 out of 70) for which UniGen3 is able to samples 1000 solutions

Fig. 1: Cactus plot showing runtime performance of UniGen3, STS, QuickSampler and CMSGen to generate 1000 samples. Timeout: 7200s

TABLE II: Runtime performance of different samplers to generate 1000 solutions for a representative set of benchmarks. Timeout (TO): 7200s.


whereas CMSGen could not sample. Similarly, there are 19 instances for which CMSGen is able to samples solutions but UniGen3 could not.

# V. CASE STUDIES: FUNCTIONAL SYNTHESIS AND COMBINATORIAL TESTING

Having established that the quality of distribution generated by CMSGen is significantly better than QuickSampler, one wonders about the practical utility of CMSGen. The significant gap between runtime performance of CMSGen and UniGen3 argues for the usage of CMSGen in applications where the quality and runtime performance of samplers are key determining factors.

To this end, we focused on two such application domains: Combinatorial testing and Boolean functional synthesis. The state of the art techniques for each of these domains crucially rely on underlying uniform samplers; in fact the sampler QuickSampler was proposed in the context of combinatorial testing. For each of these case studies, we substitute the three samplers CMSGen, QuickSampler, and UniGen in the state of the art techniques, and analyse their performance on the resulting tool.

#### *A. Combinatorial Testing*

Combinatorial testing is considered as a powerful paradigm for testing configurable software. The primary task of a test generator is the generation of a test suite that maximizes t-wise coverage. t-wise coverage is measured as the fraction of feature combinations appearing in the test set out of the possible valid feature combinations. Uniform sampling is considered one of the promising approach to have higher t-wise coverage [31], [34], [35]. Therefore, a natural question is whether CMSGen can serve as a good test suite generator. To this end, we performed a comparative study of CMSGen vis-a-vis UniGen3, STS and QuickSampler on the set of 110 publicly available benchmarks that have been employed in prior comparative studies of sampling techniques in the context of combinatorial testing [25], [29], [35]<sup>3</sup> .It is worth emphasizing that UniGen3, STS and QuickSampler are viewed as a state of the art test suite generation techniques in the presence of constraints as witnessed by empirical study by Plazar et al. [35].

In our comparative study of sampling techniques of their efficiency in achieving higher t-wise coverage, we focus on the case of t = 2 as is standard in the most empirical studies in combinatorial testing. To this end, for every benchmark, we generate 1000 samples from each of the four samplers: CMSGen, STS, QuickSampler, and UniGen3. We used a timeout of 3600 seconds for sampling. UniGen3 is, however, unable to sample for all but six benchmarks. Therefore, we exclude UniGen3 from further analysis.

Fig. 2: Plot to show *2*-wise coverage% for 110 benchmarks with 1000 samples. Sampling timeout: 3600s.

<sup>3</sup>Benchmarks are available at https://zenodo.org/record/4022395


TABLE III: Analysis for 2-wise coverage with QuickSampler, STS, and CMSGen.

Figure 2 shows the experimental results with STS, Quick-Sampler and CMSGen. We present the number of benchmarks on the x-axis and pair-wise coverage % on the y-axis. A point (x, y) implies that x benchmarks had y% pair-wise coverage. Benchmarks are ordered in the decreasing order of coverage achieved with the samples produced by STS. Figure 2 shows that almost all the benchmarks had nearly 100% pair-coverage with samples generated by CMSGen, on the other hand, the average pair-wise coverage with samples from QuickSampler and STS is 51.5% and 80.15%. One should view the significant performance improvement due to CMSGen over QuickSampler in light of the fact that the primary motivation behind the proposal of QuickSampler was to achieve higher coverage.

Table III represents the analysis for 2-wise coverage with CMSGen, STS and QuickSampler for representative 20 benchmarks. In table III, Column 2 present the possible valid feature combinations. Column 3, 5 and 7 present the feature combinations appearing in test set generated by QuickSampler, STS and CMSGen respectively, and Column 4,6 and 8 is for the corresponding coverage. As shown in Table III, the test set generated with CMSGen is able to cover *all* possible feature combinations for all the benchmarks.

#### *B. Boolean Functional Synthesis*

Given a formula ∃Y F(X, Y ), the problem of Boolean functional synthesis seeks to compute a function ϕ such that ∃Y F(X, Y ) ≡ F(X, ϕ(X)). Typically, we view F as a specification and ϕ as the function that implements the specification ϕ. Boolean functional synthesis is a fundamental problem with wide variety of applications ranging from logic synthesis [28], cryptography [30], program synthesis [42], and the like. For example, Boolean functional synthesis encompasses program synthesis, where ϕ can be viewed as the desired program. Consequently, there has been a sustained interest in the design of efficient algorithmic techniques for Boolean functional synthesis. The current state of the art approach, Manthan, was proposed recently and builds on the advances in sampling techniques, automated reasoning, and machine learning [21]. Manthan was demonstrated to solve 70 more benchmarks than the next best technique. In this regard, Manthan serves as a good test-bed to compare different sampling techniques.

Fig. 3: Cactus plot to show the impact of different sampler on functional synthesis engine, Manthan. Timeout: 7200s

We sought to compare CMSGen vis-a-vis UniGen3, STS and QuickSampler in their impact on the performance of Manthan. We set the timeout of 3600 seconds for the sampling phase of Manthan. To this end, we augment the sampling step of Manthan with the corresponding samplers. We perform the empirical analysis of the same 609 benchmarks<sup>4</sup> that were employed in the analysis of Manthan [21]. We present a

<sup>4</sup>Benchmarks are available at https://zenodo.org/record/3892859

summary of our analysis in the form of cactus plot in Figure 3: the number of instances are shown on the x-axis and the time taken on the y-axis; a point (x, y) implies that Manthan augmented with the corresponding sampler took less than or equal to y seconds to solve x instances.

Table IV shows the time taken to synthesize Boolean functions with samples generated from different samplers for a representative set of 20 benchmarks.

Few observations are in order:


TABLE IV: Runtime analysis of Manthan with QuickSampler, STS, UniGen3, and CMSGen. Timeout (TO): 7200s.


Therefore, in conclusion, Manthan augmented with CMSGen solves significantly more instances than Manthan augmented with UniGen3, STS, or QuickSampler.

### VI. CONCLUSION

Motivated by the availability of Barbarik, a tester for samplers, we sought to design a sampler for which Barbarik would return Accept. We succeeded in our task by a simple but careful tweaking of the existing state-of-the-art SAT solver, CryptoMiniSat. Our resulting sampler CMSGen is not only accepted by Barbarik but achieves better runtime performance than state-of-the-art samplers with theoretical guarantees. We then show that the resulting sampler, CMSGen, can significantly improve the performance of applications that utilize samplers. It is perhaps worth reiterating that we view the simplicity of CMSGen as its salient strength. The simplicity of CMSGen stands in stark contrast to complicated algorithmic schemes developed in the past that fail to attain the desired quality of distributions with practical runtime performance.

We now turn our attention back to Remark 1; the design of CMSGen was an iterative process with Barbarik in loop. A natural direction of future work would be the development of a tester that provides a quantitative analysis instead of a qualitative answer of Accept or Reject to measure the quality of samplers. The significant runtime improvements in the context of functional synthesis and combinatorial testing due to CMSGen motivate us to study the impact of CMSGen in other application domains; to this end, we will release CMSGen open-source upon publication of our manuscript.

Acknowledgments: This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0004 ] and AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. The computational work for this article was performed on resources of the National Supercomputing Centre, Singapore: https://www.nscc.sg

#### REFERENCES


# SAT-Inspired Eliminations for Superposition

Petar Vukmirovic´ 1 , Jasmin Blanchette1,2 , Marijn J.H. Heule<sup>3</sup>

<sup>1</sup>Vrije Universiteit Amsterdam, Amsterdam, the Netherlands <sup>2</sup>Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France ´ <sup>3</sup>Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

*Abstract*—Optimized SAT solvers not only preprocess the clause set, they also transform it during solving as inprocessing. Some preprocessing techniques have been generalized to frstorder logic with equality. In this paper, we port inprocessing techniques to work with superposition, a leading frst-order proof calculus, and we strengthen known preprocessing techniques. Specifcally, we look into elimination of hidden literals, variables (predicates), and blocked clauses. Our evaluation using the Zipperposition prover confrms that the new techniques usefully supplement the existing superposition machinery.

### I. INTRODUCTION

Automated reasoning tools have become much more powerful in the last few decades thanks to procedures such as confict-driven clause learning (CDCL) [1] for propositional logic and superposition [2] for frst-order logic with equality. However, the effectiveness of these procedures crucially depends on how the input problem is represented as a clause set. The clause set can be optimized beforehand (*preprocessing*) or during the execution of the procedure (*inprocessing*). In this paper, we lift several preprocessing and inprocessing techniques from propositional logic to clausal frst-order logic and demonstrate their usefulness in a superposition prover.

For many years, SAT solvers have used inexpensive clause simplifcation techniques such as hidden literal and hidden tautology elimination [3], [4] and failed literal detection [5, Sect. 1.6]. We generalize these techniques to frst-order logic with equality (Sect. III). Since the generalization involves reasoning about infnite sets of literals, we propose restrictions to make them usable.

*Variable elimination*, based on Davis–Putnam resolution [6], has been studied in the context of both propositional logic [7], [8] and quantifed Boolean formulas (QBFs) [9]. The basic idea is to resolve all clauses with negative occurrences of a propositional variable (i.e., a nullary predicate symbol) against clauses with positive occurrences and delete the parent clauses. Een and Biere [10] refned the technique to identify a ´ subset of clauses that effectively defne a variable and use it to further optimize the clause set. This latter technique, *variable elimination by substitution*, has been an important preprocessor component in many SAT solvers since its introduction in 2004.

Specializing second-order quantifer elimination [11], [12], Khasidashvili and Korovin [13] adapted variable elimination to preprocess frst-order problems, yielding a technique we call *singular predicate elimination*. We extend their work along two axes (Sect. IV): We generalize Een and Biere's refnement ´ to frst-order logic, resulting in *defned predicate elimination*, and explain how both types of predicate elimination can be used during the proof search as inprocessing.

The last technique we study is *blocked clause elimination* (Sect. V). It is used in both SAT [14] and QBF solvers [15]. Its generalization to frst-order logic has produced good results when used as a preprocessor, especially on satisfable problems [16]. We explore more ways to use blocked clause elimination on satisfable problems, including using it to establish equisatisfability with an empty clause set or as an inprocessing rule. Unfortunately, we fnd that its use as inprocessing can compromise the refutational completeness of superposition.

All techniques are implemented in the Zipperposition prover (Sect. VI), allowing us to ascertain their usefulness (Sect. VII). The best confguration solves 160 additional problems on benchmarks consisting of all 13 495 frst-order TPTP theorems [17]. The raw experimental data are publicly available.<sup>1</sup> More details, including all the proofs, can be found in a technical report [18].

# II. PRELIMINARIES

# *A. Clausal First-Order Logic*

Our setting is many-sorted, or many-typed, frst-order logic [19] with interpreted equality and a distinguished type (or sort) *o*. Each variable *x* is assigned a non-Boolean type, and each symbol <sup>f</sup> is assigned a tuple (τ1,...,τ*n*,τ) where *<sup>n</sup>* <sup>≥</sup> 0, <sup>τ</sup>*<sup>i</sup>* are non-Boolean types, and τ is the *result type*. We distinguish between *predicate symbols*, with *o* as the result type, and *function symbols*. Nullary function symbols are called *constants*. Terms are either variables *x* or well-typed applications <sup>f</sup>(*t*1,...,*tn*), or <sup>f</sup> if *<sup>n</sup>* <sup>=</sup> <sup>0</sup>. A term is *ground* if it contains no variables. We assume standard defnitions and notations for positions, subterms, and contexts [20]. We abbreviate a vector (*a*1,...,*an*) to ⃗*a<sup>n</sup>* or ⃗*a*, and write <sup>f</sup> *i* (*s*) for the *i*-fold application of an unary symbol f (e.g., f 3 (*x*) = f(f(f(*x*)))).

An atom is an equation *s*≈*t* corresponding to an unordered pair {*s*,*t*}. A literal is an equation *<sup>s</sup>*≈*<sup>t</sup>* or a disequation *<sup>s</sup>* ̸≈ *<sup>t</sup>*. For every predicate symbol <sup>p</sup>, <sup>p</sup>(⃗*s*) abbreviates <sup>p</sup>(⃗*s*) ≈ ⊤, and <sup>¬</sup>p(⃗*s*) abbreviates <sup>p</sup>(⃗*s*) ̸≈ ⊤, where <sup>⊤</sup> is a distinguished constant of type *o*. We distinguish between *predicate literals* (¬)p(⃗*s*) and *functional literals <sup>s</sup>*≈*t*, where *<sup>s</sup>* and *<sup>t</sup>* are not of type *o*. Given a literal *L*, we overload notation and write ¬*L* to denote its complement. A clause *C* is a multiset of literals,

<sup>1</sup>https://doi.org/10.5281/zenodo.4552499

written as *L*<sup>1</sup> ∨ ··· ∨ *L<sup>n</sup>* and interpreted disjunctively. Clauses are often defned as sets of literals, but superposition needs multisets; with multisets, an instance *<sup>C</sup>*σ always has the same number of literals as *C*, a most convenient property. Given a clause set *N*, *N*↓<sup>2</sup> denotes the subset of its binary clauses: *N*↓<sup>2</sup> = {*L*<sup>1</sup> ∨ *L*<sup>2</sup> | *L*<sup>1</sup> ∨ *L*<sup>2</sup> ∈ *N*}.

#### *B. Superposition Provers*

Superposition [2] is a calculus for clausal frst-order logic that extends ordered resolution [21] with equality reasoning. It is refutationally complete: Given a fnite, unsatisfable clause set, it will eventually derive the empty clause. It is parameterized by a *selection function* that infuences which of a clause's literals are eligible as the target of inferences. Moreover, it is compatible with the *standard redundancy criterion*, which can be used to delete a clause *C* while preserving completeness of the calculus.

The redundancy criterion relies on an order ≻ that compares terms, literals, or clauses. The order is used to determine whether clauses can be deleted. If *N* is ground, *C* can be deleted if it is entailed by <sup>≺</sup>-smaller clauses in *<sup>N</sup>*. This defnition is lifted to nonground sets *N*. The criterion can be used to delete a clause that is *subsumed* by another clause (e.g., p(a) ∨ q by p(*x*)) or to *simplify* a clause *C* into *C* ′ , which amounts to adding *C* ′ and then deleting *C* as redundant with respect to *N*∪{*C* ′}. Subsumption and simplifcation are the main inprocessing mechanisms available to superposition provers. Some provers also implement clause splitting [22]–[24].

Superposition provers saturate the input problem with respect to the calculus's inference rules using the *given clause procedure* [25], [26]. It partitions the proof state into a passive set *P* and an active set *A*. All clauses start in *P*. At each iteration of the procedure's main loop, the prover chooses a clause *C* from *P*, simplifes it, and moves it to *A*. Then all inferences between *C* and active clauses are performed. The resulting clauses are again simplifed and put in *P*.

#### III. HIDDEN-LITERAL-BASED ELIMINATION

In propositional logic, binary clauses from a clause set *N* can be used to effciently discover literals *<sup>L</sup>*,*<sup>L</sup>* ′ for which the implication *L* ′ −→■ *L* is entailed by *N*'s binary clauses—i.e., *N*↓<sup>2</sup> |= *L* ′ −→■ *L*. Heule et al. [4] introduced the concept of *hidden literals* to capture such implications.

*Defnition 1:* Given a propositional literal *L* and a propositional clause set *N*, the set of *propositional hidden literals* for *<sup>L</sup>* and *<sup>N</sup>* is HLp(*L*,*N*) = {*<sup>L</sup>* ′ | *L* ′ ↪→∗ <sup>p</sup> *L*} \ {*L*}, where ↪→<sup>p</sup> is defned such that <sup>¬</sup>*L*<sup>1</sup> ↪→<sup>p</sup> *<sup>L</sup>*<sup>2</sup> whenever *<sup>L</sup>*<sup>1</sup> <sup>∨</sup> *<sup>L</sup>*<sup>2</sup> <sup>∈</sup> *<sup>N</sup>*. Moreover, HLp(*L*<sup>1</sup> ∨ ··· ∨ *<sup>L</sup>n*,*N*) = <sup>⋃</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> HLp(*L<sup>i</sup>* ,*N*).

Heule et al. used a fxpoint computation, but our defnition based on the refexive transitive closure is equivalent. Intuitively, a hidden literal can be added to or removed from a clause without affecting its semantics in models of *<sup>N</sup>*. By eliminating hidden literals from *C*, we simplify it. By adding hidden literals to *C*, we might get a tautology *C* ′ (i.e., a valid clause: |= *C* ′ ), meaning that *N*↓<sup>2</sup> |= *C*, thereby enabling us to delete *<sup>C</sup>*. Note that HLp(*L*,*N*) is fnite for a fnite *<sup>N</sup>*.

*Defnition 2:* Given *L* ′ <sup>∨</sup> *<sup>L</sup>* <sup>∨</sup> *<sup>C</sup>* <sup>∈</sup> *<sup>N</sup>*, if *<sup>L</sup>* ′ <sup>∈</sup> HLp(*L*,*N*), *hidden literal elimination* (HLE) replaces *N* by (*N* \ {*L* ′ ∨ *L* ∨ *<sup>C</sup>*})∪ {*<sup>L</sup>* <sup>∨</sup> *<sup>C</sup>*}. Given *<sup>C</sup>* <sup>∈</sup> *<sup>N</sup>*, {*L*1,...,*Ln*} <sup>=</sup> HLp(*C*,*N*), and *C* ′ = *C* ∨ *L*<sup>1</sup> ∨ ··· ∨ *Ln*, if *C* ′ is a tautology, *hidden tautology elimination* (HTE) replaces *N* by *N* \ {*C*}.

*Theorem 3:* The result of applying HLE or HTE to a clause set *N* is equivalent to *N*.

*Proof:* For HLE, if *L* ′ <sup>∈</sup> HLp(*L*,*N*), *<sup>N</sup>*↓<sup>2</sup> |= ¬*L* ′ ∨ *L*. Then, subsumption resolution yields shortened clause *L* ∨ *C* ′ from Defnition 2. For HTE, it can be shown that *N* ′ |=*C* if and only if *C* ∨ *L* ′ , where *L* ′ <sup>∈</sup> HLp(*C*,*N*). By transitivity of equivalence, we get the desired result.

We generalize hidden literals to frst-order logic with equality by considering substitutivity of variables as well as congruence of equality.

*Defnition 4:* Given a literal *L* and a clause set *N*, the set of *hidden literals* for *<sup>L</sup>* and *<sup>N</sup>* is HL(*L*,*N*) = {*<sup>L</sup>* ′ | *L* ′ ↪→<sup>∗</sup> *<sup>L</sup>*} \ {*L*}, where ↪<sup>→</sup> is defned so that (1) <sup>¬</sup>*<sup>L</sup>* ′σ ↪<sup>→</sup> *<sup>L</sup>*σ if *<sup>L</sup>* ′ ∨ *<sup>L</sup>* <sup>∈</sup> *<sup>N</sup>* and σ is a substitution; (2) *<sup>s</sup>* <sup>≈</sup> *<sup>t</sup>* ↪<sup>→</sup> *<sup>u</sup>*[*s*] <sup>≈</sup> *<sup>u</sup>*[*t*] for all terms *<sup>s</sup>*,*<sup>t</sup>* and contexts *<sup>u</sup>*[ ]; and (3) *<sup>u</sup>*[*s*] ̸≈*u*[*t*] ↪<sup>→</sup> *<sup>s</sup>* ̸≈*<sup>t</sup>* for all terms *<sup>s</sup>*,*<sup>t</sup>* and contexts *<sup>u</sup>*[ ]. Moreover, HL(*L*<sup>1</sup> ∨ ··· ∨ *<sup>L</sup>n*,*N*) = ⋃*n <sup>i</sup>*=<sup>1</sup> HL(*L<sup>i</sup>* ,*N*).

The generalized defnition also enjoys the key property that *L* ′ <sup>∈</sup> HL(*L*,*N*) implies *<sup>N</sup>*↓<sup>2</sup> |= *L* ′ −→■ *<sup>L</sup>*. However, HL(*L*,*N*) may be infnite even for predicate literals; for example, p(f *i* (*x*)) <sup>∈</sup> HL(p(*x*),{p(*x*) ∨ ¬p(f(*x*))}) for every *<sup>i</sup>*.

Based on Defnition 4, we can generalize hidden literal elimination and support a related technique:

$$\frac{L' \lor L \lor C}{L \lor C} \text{HLE} \quad \text{if } L' \in \text{HL}(L, N)$$

$$\frac{L \lor C}{C} \text{FLE} \quad \text{if } L', \neg L' \in \text{HL}(\neg L, N)$$

Double lines denote *simplifcation rules*: When the premises appear in the clause set, the prover can use the redundancy criterion to replace them by the conclusions. The second rule is called *failed literal elimination*, inspired by the SAT technique of asserting ¬*L* if *L* is a *failed literal* [5]. It is easy to see that rule HLE is sound. From *L* ′ <sup>∈</sup> HL(*L*,*N*) we have *N* |= *L* ′ −→■ *L* (i.e., ¬*L* ′ ∨ *L*). Performing subsumption resolution [21] between *L* ′ ∨ *L* ∨ *C* and ¬*L* ′ ∨ *L* yields the conclusion, which is therefore entailed by *<sup>N</sup>*. For FLE, the condition *L* ′ ,¬*L* ′ <sup>∈</sup> HL(¬*L*,*N*) means that *<sup>N</sup>*↓<sup>2</sup> |= {¬*L* ′ ∨ <sup>¬</sup>*L*, *<sup>L</sup>* ′ ∨ ¬*L*} |= ¬*L*.

*Example 5:* Consider the clause set *N* = {p(*x*) ∨ <sup>¬</sup>p(f(*x*)), <sup>p</sup>(f(f(*x*))) <sup>∨</sup> <sup>a</sup> <sup>≈</sup> <sup>b</sup>} and the clause *<sup>C</sup>* <sup>=</sup> <sup>f</sup>(a) ̸≈ <sup>f</sup>(b) <sup>∨</sup> <sup>p</sup>(*x*). The frst clause in *<sup>N</sup>* induces <sup>p</sup>(f(*x*)) ↪<sup>→</sup> <sup>p</sup>(*x*), <sup>p</sup>(f(f(*x*)))↪<sup>→</sup> <sup>p</sup>(f(*x*)), and hence <sup>p</sup>(f(f(*x*)))↪→<sup>∗</sup> <sup>p</sup>(*x*). Together with the second clause in *N*, it can be used to derive <sup>a</sup≯≈<sup>b</sup> ↪→<sup>∗</sup> <sup>p</sup>(*x*). Finally, using rule (3) of Defnition 4, we derive <sup>f</sup>(a) ̸≈f(b) ↪→<sup>∗</sup> <sup>p</sup>(*x*)—that is, <sup>f</sup>(a) ̸≈f(b) <sup>∈</sup> HL(p(*x*),*N*). This allows us to remove *C*'s frst literal using HLE.

Two special cases of HLE exploit equality congruence as embodied by conditions (2) and (3) of Defnition 4 without requiring to compute the HL set:

$$\frac{s \approx t \lor u[s] \approx u[t] \lor C}{u[s] \approx u[t] \lor C} \text{CONGHLE}^+$$

$$\frac{s \not\approx t \lor u[s] \not\approx u[t] \lor C}{s \not\approx t \lor C} \text{CONGHLE}^-$$

Hidden literals can be combined with unit clauses *L* ′ to remove more literals:

$$\frac{L' \quad L \lor C}{L' \quad C} \text{UNITHLE} \quad \text{if } L' \\ \sigma \in \text{HL}(\neg L, N)$$

Given a unit clause *L* ′ ∈ *N*, the rule uses it to discharge *L* ′σ in *N* |= *L* ′σ −→ ¬ ■ *<sup>L</sup>*. As a result, we have *<sup>N</sup>* <sup>|</sup><sup>=</sup> <sup>¬</sup>*L*, making it possible to remove *L* from *L* ∨ *C*.

*Example 6:* Consider the clause set *N* = {p(*x*) ∨ <sup>q</sup>(f(*x*)), <sup>¬</sup>q(f(a)) <sup>∨</sup> <sup>f</sup>(b) <sup>≈</sup> <sup>g</sup>(c), <sup>f</sup>(*x*) ̸≈ <sup>g</sup>(*y*)} and the clause *C* = ¬p(a) ∨ ¬q(b). The frst clause in *N* induces <sup>¬</sup>q(f(a)) ↪<sup>→</sup> <sup>p</sup>(a), whereas the second one induces <sup>f</sup>(b)̸≈g(c)↪→ ¬q(f(a)). Thus, we have <sup>f</sup>(b)̸≈g(c)↪→<sup>∗</sup> <sup>p</sup>(a) that is, <sup>f</sup>(b)̸≈f(c) <sup>∈</sup> HL(p(a),*N*). By applying the substitution {*<sup>x</sup>* ↦→ <sup>b</sup>, *<sup>y</sup>* ↦→ <sup>c</sup>} to the third clause in *<sup>N</sup>*, we can fulfll the conditions of UNITHLE and remove *C*'s frst literal.

Next, we generalize hidden tautologies to frst-order logic.

*Defnition 7:* A clause *C* is a *hidden tautology* for a clause set *<sup>N</sup>* if there exists a fnite set {*L*1,...,*Ln*} ⊆ HL(*C*,*N*) such that *C* ∨ *L*<sup>1</sup> ∨ ··· ∨ *L<sup>n</sup>* is a tautology.

*Example 8:* In general, hidden tautologies are not redundant and cannot be deleted during saturation. Consider the unsatisfable set *<sup>N</sup>* <sup>=</sup> {¬a, <sup>¬</sup>b, <sup>a</sup> <sup>∨</sup> <sup>c</sup>, <sup>b</sup> ∨ ¬c}, the order <sup>a</sup> <sup>≺</sup> <sup>b</sup> <sup>≺</sup> <sup>c</sup>, and the empty selection function. The only possible superposition inference from *N* is between the last two clauses, yielding the hidden tautology a ∨ b (after simplifying away ⊤ ̸≈ ⊤), which is entailed by the larger clauses a ∨ c and b ∨ ¬c. If this clause is removed, the prover could enter an infnite loop, forever generating and deleting the hidden tautology.

To delete hidden tautologies during saturation, the prover could check that all the relevant clause instances encountered along the computation of HL are ≺-smaller than a given hidden tautology. However, this would be expensive and seldom succeed, given that superposition creates lots of nonredundant hidden tautologies. Instead, we propose to simplify hidden tautologies using the following rules:

$$\frac{L \lor L' \lor C}{L \lor L'} \text{HTR} \quad \text{if } \neg L' \in \text{HL}(L, N) \text{ and } C \neq \bot}$$

$$\frac{L \lor C}{L} \text{FLR} \quad \text{if } L', \neg L' \in \text{HL}(L, N) \text{ and } C \neq \bot}$$

We call these techniques *hidden tautology reduction* and *failed literal reduction*, respectively. Both rules are sound. As with hidden literals, unit clauses *L* ′ can be exploited:

$$\frac{L' \quad L \lor C}{L' \quad L} \text{UNITHTR} \quad \text{if } L' \\ \sigma \in \text{HL}(L, N) \text{ and } C \neq \bot$$

We give the simplifcation rules above the collective name of *hidden-literal-based elimination* (HLBE). Yet another use of hidden literals is for *equivalent literal substitution* [3]: If both *L* ′ <sup>∈</sup> HL(*L*,*N*) and *<sup>L</sup>* <sup>∈</sup> HL(*<sup>L</sup>* ′ ,*N*), we can often simplify *<sup>L</sup>* ′σ to *<sup>L</sup>*σ in *<sup>N</sup>* if *<sup>L</sup>* ′σ <sup>≻</sup> *<sup>L</sup>*σ. We want to investigate this further.

*Theorem 9:* The rules HLE, FLE, CONGHLE+, CONG HLE−, UNITHLE, HTR, FLR, and UNITHTR are sound simplifcation rules.

#### IV. PREDICATE ELIMINATION

For propositional logic, variable elimination [10] is one of the main preprocessing and inprocessing techniques. Following Gabbay and Ohlbach's ideas [11], Khasidashvili and Korovin [13] generalized variable elimination to frst-order logic with equality and demonstrated that it is effective as a preprocessor. We propose an improvement that makes this applicable in more cases and show that, with a minor restriction, it can be integrated in a superposition prover without compromising its refutational completeness.

#### *A. Singular Predicates*

Khasidashvili and Korovin's preprocessing technique removes singular predicates (which they call "non-selfreferential predicates") from the problem using so-called fat resolution.

*Defnition 10:* A predicate symbol is called *singular* (or "non-self-referential") for a clause set *N* if it occurs at most once in every clause contained in *<sup>N</sup>*.

*Defnition 11:* Let *<sup>C</sup>* <sup>=</sup> <sup>p</sup>(⃗*sn*) <sup>∨</sup> *<sup>C</sup>* ′ and *<sup>D</sup>* <sup>=</sup> <sup>¬</sup>p(⃗*tn*) <sup>∨</sup> *<sup>D</sup>* ′ be clauses with no variables in common. The clause *s*<sup>1</sup> ̸≈ *t*<sup>1</sup> ∨ ··· ∨ *s<sup>n</sup>* ̸≈*t<sup>n</sup>* ∨ *C* ′ ∨ *D* ′ is a *fat resolvent* of *C* and *D* on p.

Given two (possibly identical) clause sets *<sup>M</sup>*,*N*, predicate elimination iteratively replaces clauses from *N* containing the symbol p with all fat resolvents against clauses in *M*. Eventually, it yields a set with no occurrences of p.

*Defnition 12:* Let *<sup>M</sup>*,*<sup>N</sup>* be clause sets and <sup>p</sup> be a singular predicate for *M*. Let ⇝ be the following relation on clause set pairs and clause sets:


The *resolved set <sup>M</sup>* ⋊<sup>p</sup> *<sup>N</sup>* is the clause set *<sup>N</sup>* ′ such that (*M*,*N*)⇝<sup>∗</sup> *<sup>N</sup>* ′ .

The relation ⇝ is confuent up to variable renaming. Thanks to the singularity constraint on *M*, it also terminates on fnite sets because the following ordinal measure decreases: ν({*D*1,...,*Dn*}) = ω <sup>ν</sup>(*D*1)⊕···⊕ω ν(*Dn*) , where ν(*D*) counts the occurrences of <sup>p</sup> in *<sup>D</sup>*, ω is the frst infnite ordinal, and <sup>⊕</sup> is the Hessenberg, or natural, sum, which is commutative. For every transition (*M*,{*C*} ∪ *<sup>N</sup>*) ⇝ (*M*,*<sup>N</sup>* ′ ∪ *N*), we have ν({*C*}) = ω <sup>ν</sup>(*C*) > ων(*C*)−<sup>1</sup> ·|*N* ′ <sup>|</sup> <sup>=</sup> ν(*<sup>N</sup>* ′ ).

Next, it is useful to partition clause sets into subsets based on the presence and polarity of a singular predicate.

*Defnition 13:* Let *N* be a clause set and p be a singular predicate for *<sup>N</sup>*. Let *<sup>N</sup>* + p consist of all clauses of the form <sup>p</sup>(⃗*s*) <sup>∨</sup> *<sup>C</sup>* ′ ∈ *N*, let *N* − p consist of all clauses of the form <sup>¬</sup>p(⃗*s*) <sup>∨</sup> *<sup>C</sup>* ′ ∈ *N*, let *N*<sup>p</sup> = *N* + <sup>p</sup> ∪*N* − p , and let *N*<sup>p</sup> = *N* \*N*p.

*Defnition 14:* Let *N* be a clause set and p be a singular predicate for *<sup>N</sup>*. *Singular predicate elimination* (SPE) of <sup>p</sup> in *N* replaces *N* by *N*<sup>p</sup> ∪(*N* + <sup>p</sup> ⋊<sup>p</sup> *<sup>N</sup>* − p ).

The result of SPE is satisfable if and only if *N* is satisfable [13, Theorem 1], justifying SPE's use in a preprocessor. However, eliminating singular predicates aggressively can dramatically increase the number of clauses. To prevent this, Khasidashvili and Korovin suggested to replace *N* by *N* ′ only if λ(*<sup>N</sup>* ′ ) <sup>≤</sup> λ(*N*) and µ(*<sup>N</sup>* ′ ) <sup>≤</sup> µ(*N*), where λ(*N*) is the number of literals in *<sup>N</sup>* and µ(*N*) is the sum for all clauses *<sup>C</sup>* <sup>∈</sup> *<sup>N</sup>* of the square of the number of distinct variables in *C*.

Compared with what modern SAT solvers use, this criterion is fairly restrictive. We relax it to make it possible to eliminate more predicates, within reason. Let *K*tol ∈ N be a tolerance parameter. A predicate elimination step from *N* to *N* ′ is allowed if λ(*<sup>N</sup>* ′ ) < λ(*N*) + *<sup>K</sup>*tol or <sup>µ</sup>(*<sup>N</sup>* ′ ) < µ(*N*) or <sup>|</sup>*<sup>N</sup>* ′ <sup>|</sup> < <sup>|</sup>*N*|<sup>+</sup> *<sup>K</sup>*tol.

#### *B. Defned Predicates*

SPE is effective, but an important refnement has not yet been adapted to frst-order logic: variable elimination by substitution. Een and Biere [10] discovered that a propositional ´ variable x can be eliminated without computing all resolvents if it is expressible as an equivalence <sup>x</sup> ←→ φ, where φ, the "gate," is an arbitrary formula that does not reference x. They partition a set *N* into a defnition set *G*, essentially the clausifcation of <sup>x</sup> ←→ <sup>φ</sup>, and *<sup>R</sup>* <sup>=</sup> *<sup>N</sup>*<sup>p</sup> \ *<sup>G</sup>*, the remaining clauses containing p. To eliminate x from *N* while preserving satisfability, it suffces to resolve clauses from *G* against clauses from *<sup>R</sup>*, effectively substituting φ for <sup>x</sup> in *<sup>R</sup>*. Crucially, we do not need to resolve pairs of clauses from *G* or pairs of clauses from *R*. We generalize this idea to frst-order logic.

*Defnition 15:* Let *G* be a clause set, p be a predicate symbol, and ⃗*<sup>x</sup>* be distinct variables. The set *<sup>G</sup>* is a *defnition set* for <sup>p</sup> if (1) p is singular for *G*, (2) *G* consists of clauses of the form (¬)p(⃗*x*) <sup>∨</sup>*<sup>C</sup>* ′ (up to variable renaming), (3) the variables in *C* ′ are all among ⃗*x*, (4) all clauses in *<sup>G</sup>* + <sup>p</sup> ⋊<sup>p</sup> *<sup>G</sup>* − p are tautologies, and (5) *<sup>E</sup>*(⃗c) is unsatisfable, where the *environment <sup>E</sup>*(⃗*x*) consists of all subclauses *C* ′ of any (¬)p(⃗*x*)∨*<sup>C</sup>* ′ <sup>∈</sup> *<sup>G</sup>* and ⃗<sup>c</sup> is a tuple of distinct fresh constants substituted in for ⃗*x*.

A defnition set *G* corresponds intuitively to a defnition by cases in mathematics—e.g.,

$$\mathfrak{p}(\vec{x}) = \begin{cases} \top & \text{if } \varphi(\vec{x}) \\ \bot & \text{if } \psi(\vec{x}) \end{cases}$$

Part (4) states that the case conditions are mutually exclusive (e.g., <sup>¬</sup>φ(⃗*x*)∨¬ψ(⃗*x*)), and part (5) states that they are exhaustive (e.g., <sup>∄</sup>⃗c. <sup>¬</sup>φ(⃗c)∧¬ψ(⃗c)). Given a quantifer-free formula <sup>p</sup>(⃗*x*) ←→ φ(⃗*x*) with distinct variables ⃗*<sup>x</sup>* such that φ(⃗*x*) does not contain p, any reasonable clausifcation algorithm would produce a defnition set for p.

*Example 16:* Given the formula p(*x*) ←→ q(*x*)∧(r(*x*) ∨ s(*x*)), a standard clausifcation algorithm [27] produces {¬p(*x*) ∨ <sup>q</sup>(*x*), <sup>¬</sup>p(*x*) <sup>∨</sup> <sup>r</sup>(*x*) <sup>∨</sup> <sup>s</sup>(*x*), <sup>p</sup>(*x*) ∨ ¬q(*x*) ∨ ¬r(*x*), <sup>p</sup>(*x*) <sup>∨</sup> ¬q(*x*) ∨ ¬s(*x*)}, which qualifes as a defnition set for p.

Defnition sets generalize Een and Biere's gates. They can ´ ⋁ be recognized syntactically for formulas such as <sup>p</sup>(⃗*x*) ←→ *<sup>i</sup>* <sup>q</sup>*i*(*s*⃗*i*) or <sup>p</sup>(⃗*x*) ←→ <sup>⋀</sup> *<sup>i</sup>* <sup>q</sup>*i*(*s*⃗*i*), or semantically: Condition (4) can be checked using the congruence closure algorithm, and condition (5) amounts to a propositional unsatisfability check.

The key result about propositional gates carries over to defnition sets.

*Defnition 17:* Let *N* be a clause set, p be a predicate symbol, *G* ⊆ *N* be a defnition set for p, and *R* = *N*<sup>p</sup> \ *G*. *Defned predicate elimination* (DPE) of p in *N* replaces *N* by *<sup>N</sup>*<sup>p</sup> <sup>∪</sup>(*G*p⋊<sup>p</sup> *<sup>R</sup>*p).

*Theorem 18:* The result of applying DPE to a clause set *N* is satisfable if and only if *N* is satisfable.

Since there will typically be at most only a few defned predicates in the problem, it makes sense to fall back on SPE when no defnition is found.

*Defnition 19:* Let *N* be a clause set and p be a predicate symbol. If there exists a defnition set *G* ⊆ *N* for p, *portfolio predicate elimination* (PPE) on p in *N* replaces *N* with *<sup>N</sup>*p∪(*G*p⋊<sup>p</sup> *<sup>R</sup>*p), where *<sup>R</sup>* <sup>=</sup> *<sup>N</sup>*<sup>p</sup> \*G*. Otherwise, if <sup>p</sup> is singular in *N*, it results in *N*<sup>p</sup> ∪(*N* + <sup>p</sup> ⋊<sup>p</sup> *<sup>N</sup>* − p ). In all other cases, it is not applicable.

#### *C. Refutational Completeness*

Hidden-literal-based techniques ft within the traditional framework of saturation, because they delete or reduce a clause based on the *presence* of other clauses. In contrast, predicate elimination relies on the *absence* of clauses from the proof state. We can still integrate it with superposition as follows: At every *k*th iteration of the given clause procedure, perform predicate elimination on *A* ∪*P*, and add all new clauses to *P*.

One may wonder whether such an approach preserves the refutational completeness of the calculus. The answer is no. To see why, consider the following *binary splitting* rule based on Riazanov and Voronkov [22]:

$$\frac{C \lor D}{\mathbf{p} \lor C \quad D \lor \neg \mathbf{p}} \text{ BS}$$

Provisos: *C* and *D* have no free variables in common, p is fresh, and p is ≺-smaller than *C* and *D*. Since the conclusions are smaller than the premise, the rule can be applied aggressively as a simplifcation. But notice that the effect of splitting can be undone by singular predicate elimination, possibly giving rise to loops BS,SPE,BS,SPE,.... This breaks completeness.

Our solution is to curtail the entailment relation used by the redundancy criterion to disallow splitting-like simplifcations. Weak entailment |=♭ is defned via an ad hoc nonclassical logic so that {<sup>p</sup> <sup>∨</sup> *<sup>C</sup>*, <sup>¬</sup><sup>p</sup> <sup>∨</sup> *<sup>C</sup>*} ̸|=♭ {*C*} and yet <sup>|</sup>=♭ {<sup>p</sup> ∨ ¬p}. More precisely, this logic is defned via an encoding: *M* |=♭ *N* if and only if *M*♭ |= *N* ♭ , where <sup>p</sup>(⃗*t*) ♭ <sup>=</sup> <sup>p</sup>(⃗*t*) ̸≈ ⊥, <sup>¬</sup>p(⃗*t*) ♭ = <sup>p</sup>(⃗*t*) ̸≈⊤, and *<sup>L</sup>* ♭ = *L* otherwise. Moreover, the type *o* may be interpreted as any set of cardinality at least 2, and ⊥ must be a distinguished symbol interpreted differently from ⊤.

The standard redundancy criterion *Red*♭ based on |=♭ supports all the familiar deletion and simplifcation techniques except splitting. Using *Red*♭ not only prevents looping, but it also enables the use of the given clause procedure, because any redundant inference according to *Red*♭ remains redundant after SPE or DPE. As usual, the devil is in the details, and the details are in the report [18].

#### V. SATISFIABILITY BY CLAUSE ELIMINATION

The main approaches to show satisfability of a frst-order problem are to produce either a fnite Herbrand model or a saturated clause set. Saturations rarely occur except for very small problems or within decidable fragments. In this section, we explore an alternative approach that establishes satisfability by iteratively removing clauses while preserving unsatisfability, until the clause set has been transformed into the empty set. So far, this technique has been studied only for QBF [28]. We show that *blocked clause elimination* (BCE) can be used for this purpose. It can effciently solve some problems for which the saturated set would be infnite. However, it can break the refutational completeness of a saturation prover. We conclude with a procedure that transforms a fnite Herbrand model into a sequence of clause elimination steps ending in the empty clause set, thereby demonstrating the theoretical power of clause elimination.

Kiesl et al. [16] generalized blocked clause elimination to frst-order logic. Their generalization uses fat *L*-resolvents, an extension of fat resolvents that resolves a single literal *L* against *m* literals of the other clause.

*Defnition 20:* Let *C* = *L* ∨ *C* ′ and *D* = *L*<sup>1</sup> ∨ ··· ∨ *L<sup>m</sup>* ∨ *D* ′ , where (1) *m* ≥ 1, (2) the literals *L<sup>i</sup>* are of opposite polarity to *<sup>L</sup>*, (3) *<sup>L</sup>*'s atom is <sup>p</sup>(⃗*sn*), (4) *<sup>L</sup>i*'s atom is <sup>p</sup>(*t<sup>i</sup>* ⃗ ) for each *<sup>i</sup>*, and (5) *C* and *D* have no variables in common. The clause (⋁*<sup>m</sup> i*=1 ⋁*n j*=1 *s<sup>j</sup>* ̸≈*ti j*) ∨ *C* ′ ∨ *D* ′ is a *fat L-resolvent* of *C* and *D*.

*Defnition 21:* A clause *C* = *L* ∨ *C* ′ is (*equality-*)*blocked* by *L* in a clause set *N* if all fat *L*-resolvents between *C* and clauses in *N* \ {*C*} are tautologies.

Removing a blocked clause from a set preserves unsatisfability [16]. Kiesl et al. evaluated the effect of removing all blocked clauses as a preprocessing step and found that it increases prover's success rate.

In fact, there exist satisfable problems that cannot be saturated in fnitely many steps regardless of the calculus's parameters but that can be reduced to an empty, vacuously satisfable problem through blocked clause elimination.

*Example 22:* Consider the clause set *N* consisting of *<sup>C</sup>* <sup>=</sup> <sup>p</sup>(*x*, *<sup>x</sup>*) and *<sup>D</sup>* <sup>=</sup> <sup>¬</sup>p(*y*1, *<sup>y</sup>*3) <sup>∨</sup> <sup>p</sup>(*y*1, *<sup>y</sup>*2) <sup>∨</sup> <sup>p</sup>(*y*2, *<sup>y</sup>*3). Note that if no literal is selected, all literals are eligible for superposition. In particular, the superposition of <sup>p</sup>(*x*, *<sup>x</sup>*) into *D*'s negative literal eventually needs to be performed regardless of the chosen selection function or term order, with the conclusion *<sup>E</sup>*<sup>1</sup> <sup>=</sup> <sup>p</sup>(<sup>1</sup>,<sup>2</sup>) <sup>∨</sup> <sup>p</sup>(<sup>2</sup>,<sup>1</sup>). Then, superposition of *<sup>E</sup>*<sup>1</sup> into *<sup>D</sup>* yields *<sup>E</sup>*<sup>2</sup> <sup>=</sup> <sup>p</sup>(<sup>1</sup>,<sup>2</sup>) <sup>∨</sup> <sup>p</sup>(<sup>2</sup>,<sup>3</sup>) <sup>∨</sup> <sup>p</sup>(<sup>3</sup>,<sup>1</sup>). Repeating this process yields infnitely many clauses *<sup>E</sup><sup>i</sup>* <sup>=</sup> <sup>p</sup>(<sup>1</sup>,<sup>2</sup>) ∨ ··· ∨ <sup>p</sup>( *<sup>i</sup>* , *<sup>i</sup>*+1) <sup>∨</sup> <sup>p</sup>( *<sup>i</sup>*+1,<sup>1</sup>) that cannot be eliminated using standard redundancy-based techniques.

In the example above, the clause *D* is blocked by its second or third literal. If we delete *D*, *C* becomes blocked in turn. Deleting *C* leaves us with the empty set, which is vacuously satisfable. The example suggests that using BCE during saturation might help focus the proof search. Indeed, Kiesl et al. ended their investigations by asking whether BCE can be used as an inprocessing technique in a saturation prover. Unfortunately, in general the answer is no.

*Example 23:* Consider the unsatisfable set *<sup>N</sup>* <sup>=</sup> {*C*1,..., *C*6}, where

$$\begin{array}{ccccc} \mathbf{C\_1 = \neg \mathbf{c} \lor \mathbf{e} \lor \neg \mathbf{a}} & \mathbf{C\_2 = \neg \mathbf{c} \lor \neg \mathbf{e}} & \mathbf{C\_3 = \mathbf{b} \lor \mathbf{c}}\\ \mathbf{C\_4 = \neg \mathbf{b} \lor \neg \mathbf{c}} & \mathbf{C\_5 = \mathbf{a} \lor \mathbf{b}} & \mathbf{C\_6 = \mathbf{c} \lor \neg \mathbf{b}} \end{array}$$

Assume the simplifcation ordering a ≺ b ≺ c ≺ d ≺ e and the selection function that chooses the last negative literal of a clause as presented. Gray boxes indicate literals that can take part in superposition inferences. Only two superposition inferences are possible: from *C*<sup>3</sup> into *C*4, yielding the tautology *C*<sup>7</sup> = b ∨ ¬b , and from *C*<sup>5</sup> into *C*6, yielding *C*<sup>8</sup> = a ∨ c . Clause *C*<sup>7</sup> is clearly redundant, whereas *C*<sup>8</sup> is blocked by its frst literal. If we allow removing blocked clauses, the prover enters a loop: *C*<sup>8</sup> is repeatedly generated and deleted. Thus, the prover will never generate the empty clause for this unsatisfable set.

As with hidden tautologies, removing blocked clauses breaks the invariant of the given clause procedure that all inferences between clauses in *A* are redundant. To see this, assume the setting of Example 23, and let *P* = *N* and *A* = /0. Assume *<sup>C</sup>*1,*C*2,*C*<sup>3</sup> are moved to the active set. As there are no possible inferences between them, the proof state becomes *<sup>A</sup>* <sup>=</sup> {*C*1,*C*2,*C*3} and *<sup>P</sup>* <sup>=</sup> {*C*4,*C*5,*C*6}. After *<sup>C</sup>*<sup>4</sup> is moved to *A*, the conclusion *C*<sup>7</sup> is computed, but it is not added to *P* as it is redundant. Moving *C*<sup>5</sup> to *A* produces no new conclusions, but after *C*<sup>6</sup> is moved, *C*<sup>8</sup> is produced. However, if we allow eliminating blocked clauses, it will not be added to *P* as it is blocked. The prover then terminates with *A* = *N* and *P* = /0, even though the original set *N* is unsatisfable.

Although using BCE as inprocessing breaks the completeness of superposition in general, it is conceivable that a well-behaved fragment of BCE might exist. This could be investigated further.

Not only can BCE prevent infnite saturation (Example 22), but it can also be used to convert a fnite Herbrand model into a certifcate of clause set satisfability. The certifcate uses only blocked clause elimination and addition, in conjunction with a transformation to reduce the clause set to an empty set. This theoretical result explores the relationship between Herbrand models and satisfability certifcates based on clause elimination and addition. It is conceivable that it can form the basis of an effcient way to certify Herbrand models.

In propositional logic, *asymmetric literals* can be added to or removed from clauses, retaining the equivalence of the resulting clause set with the original one. Kiesl and Suda [29] described an extension of this technique to frst-order logic. Their defnition of asymmetric literals can be relaxed to allow the addition of more literals, but the resulting set is then only equisatisfable to the original one, not equivalent. This in turn allows us to show that a problem is satisfable by reducing it to an empty problem, as is done in some SAT solvers.

For the rest of this section, we work with clausal frstorder logic without equality. We use Herbrand models as canonical representatives of frst-order models, recalling that every satisfable set has a Herbrand model [30, Sect. 5.4].

*Defnition 24:* A literal *L* is a *global asymmetric literal* (GAL) for a clause *C* and a clause set *N* if for every ground instance *<sup>C</sup>*σ of *<sup>C</sup>*, there exists a ground instance *<sup>D</sup>*ϱ <sup>∨</sup> *<sup>L</sup>* ′ϱ of *D* ∨ *L* ′ <sup>∈</sup> *<sup>N</sup>* \ {*C*} such that *<sup>D</sup>*ϱ <sup>⊆</sup> *<sup>C</sup>*σ and <sup>¬</sup>*<sup>L</sup>* ′ϱ <sup>=</sup> *<sup>L</sup>*σ.

Every asymmetric literal is GAL, but the converse does not hold:

*Example 25:* Consider a clause *<sup>C</sup>* <sup>=</sup> <sup>p</sup>(*x*, *<sup>y</sup>*) and a clause set *<sup>N</sup>* <sup>=</sup> {<sup>q</sup> <sup>∨</sup> <sup>p</sup>(a,a)}. Then, <sup>¬</sup><sup>q</sup> is not an asymmetric literal for *C* and *N*, but it is a GAL for *C* and *N*.

Adding and removing GALs maintains preserves and refects satisfability:

*Theorem 26:* If *L* is a GAL for the clause *C* and the clause set *N*, then the set (*N* \ {*C*})∪ {*C* ∨ *L*} is satisfable if and only if *N* is satisfable.

For frst-order logic without equality, a clause *L* ∨*C* is blocked if all its *L*-resolvents are tautologies [16]. The *L*-resolvent between *<sup>L</sup>* <sup>∨</sup> *<sup>C</sup>* and <sup>¬</sup>*L*<sup>1</sup> ∨ ··· ∨ ¬*L<sup>n</sup>* <sup>∨</sup> *<sup>D</sup>* is (*<sup>C</sup>* <sup>∨</sup> *<sup>D</sup>*)σ, where <sup>σ</sup> is the most general unifer of the literals *<sup>L</sup>*,*L*1,...,*L<sup>n</sup>* [21]. Given a Herbrand model J of a problem, the following procedure removes all clauses while preserving satisfability:


resolvents are tautologies. Thus, each q ∨ *L* is blocked and can be removed in turn.

4) The remaining clauses all contain the literal ¬q. They can be removed by BCE as well.

The procedure is limited to the frst-order logic without equality, since step 3 is justifed only if *L* is a predicate literal. (Otherwise, *L* cannot block clause q ∨ *L* [16].) The procedure also terminates only for fnite Herbrand models.

*Example 27:* Consider the satisfable clause set *N* = {r(*x*) ∨ <sup>s</sup>(*x*), <sup>¬</sup>r(a), <sup>¬</sup>s(b)} and a Herbrand model <sup>J</sup> over {a,b,r,s} such that r(b) and s(a) are the only true atoms in J. We show how to remove all clauses in *N* using J by following the procedure above.

Let *<sup>N</sup>*<sup>J</sup> <sup>=</sup> {<sup>q</sup> ∨ ¬r(a), <sup>q</sup> <sup>∨</sup> <sup>r</sup>(b), <sup>q</sup> <sup>∨</sup> <sup>s</sup>(a), <sup>q</sup> ∨ ¬s(b)}. We set *N* ← *N* ∪*N*J. This preserves satisfability since all clauses in *N*<sup>J</sup> are blocked. It is easy to check that ¬q is GAL for every clause in *N* \*N*J. The only substitutions that need to be considered are {*x* ↦→ a} and {*x* ↦→ b} for r(*x*) ∨ s(*x*). So we set *<sup>N</sup>* ← {¬<sup>q</sup> <sup>∨</sup> <sup>r</sup>(*x*) <sup>∨</sup> <sup>s</sup>(*x*), <sup>¬</sup><sup>q</sup> ∨ ¬r(a), <sup>¬</sup><sup>q</sup> ∨ ¬s(b)} ∪ *<sup>N</sup>*J. Clearly, all clauses in *N*<sup>J</sup> are blocked, so we set *N* ← *N* \ *N*J. All clauses remaining in *N* have a literal ¬q and can be removed, leaving *N* empty as desired.

#### VI. IMPLEMENTATION

Hidden-literal-based, predicate, and blocked clause elimination all admit effcient implementations in a superposition prover. In this section, we describe how to implement the frst two sets of techniques. For BCE, we refer to Kiesl et al. [16]. All techniques are implemented in the Zipperposition prover [31]. Zipperposition is designed for fast prototyping of improvements to superposition, but it implements many of the most successful heuristics from the E prover [32] and has recently become quite competitive [33].

#### *A. Hidden-Literal-Based Elimination*

For HLBE, an effcient representation of HL(*L*,*N*) is crucial. Because this set may be infnite, we underapproximate it by restricting the length of the transitive chains via a parameter *K*len. Given the current clause set *N*, the fnite map *Imp*[*L* ′ ] associates with each literal *L* ′ a set of pairs (*L*, *<sup>M</sup>*) such that *L* ′ ↪→*<sup>k</sup> <sup>L</sup>*, where *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>*len and *<sup>M</sup>* is the multiset of clauses used to derive *L* ′ ↪→*<sup>k</sup> <sup>L</sup>*. Moreover, we consider only transitions of type (1) (as per Defnition 4). The following algorithm maintains *Imp* dynamically, updating it as the prover derives and deletes clauses. It depends on the global variable *Imp* and the parameters *K*len and *K*imp.

procedure <sup>A</sup>DDIMPLICATION(*L*a, *<sup>L</sup>*c,*C*) if *Imp*[*L*aσ] ̸<sup>=</sup> /0 for some renaming σ then (*L*a,*L*c) <sup>←</sup> (*L*aσ,*L*cσ) if there are no *<sup>L</sup>*,*<sup>L</sup>* ′ , *<sup>M</sup>*,σ such that (*<sup>L</sup>* ′ , *<sup>M</sup>*) <sup>∈</sup> *Imp*[*L*], <sup>5</sup> *<sup>L</sup>*σ <sup>=</sup> *<sup>L</sup>*a, and *<sup>L</sup>* ′<sup>σ</sup> <sup>=</sup> *<sup>L</sup>*<sup>c</sup> then for all (σ, *<sup>M</sup>*) such that (*L*cσ, *<sup>M</sup>*) <sup>∈</sup> *Imp*[*L*aσ] do erase all (*L* ′ , *M*′ ) such that *M* ⊆ *M*′ from *Imp*[*L*aσ] for all *L* such that (*L* ′ , *<sup>M</sup>*) <sup>∈</sup> *Imp*[*L*]

and *<sup>L</sup>*aσ <sup>=</sup> *<sup>L</sup>* ′ for some σ do <sup>10</sup> if <sup>|</sup>*M*<sup>|</sup> <sup>&</sup>lt; *<sup>K</sup>*len then *Imp*[*L*] <sup>←</sup> *Imp*[*L*]∪ {(*L*cσ, *<sup>M</sup>* ⊎ {*C*})} for all *L* such that *Imp*[*L*] ̸= /0 and *<sup>L</sup>*<sup>σ</sup> <sup>=</sup> *<sup>L</sup>*<sup>c</sup> for some <sup>σ</sup> do *Concl* ← {(*L* ′σ, *<sup>M</sup>* ⊎ {*C*}) <sup>|</sup> 15 (*L* ′ , *<sup>M</sup>*) <sup>∈</sup> *Imp*[*L*],|*M*<sup>|</sup> < *<sup>K</sup>*len} *Imp*[*L*a] ← *Imp*[*L*a]∪*Concl Congr* ← {(*<sup>s</sup>* ̸≈*t*,{*C*}) | ∃*u*.*L*<sup>c</sup> <sup>=</sup> *<sup>u</sup>*[*s*] ̸≈*u*[*t*]} *Imp*[*L*a] <sup>←</sup> *Imp*[*L*a]∪ {(*L*c,{*C*})} ∪*Congr* procedure TRACKCLAUSE(*C*) 20 if *C* = *L*<sup>1</sup> ∨ *L*<sup>2</sup> then ADDIMPLICATION(¬*L*1, *L*2, *C*) ADDIMPLICATION(¬*L*2, *L*1, *C*)

if *<sup>L</sup>*<sup>2</sup> <sup>=</sup> <sup>¬</sup>*L*1<sup>σ</sup> for some nonidempotent <sup>σ</sup> then for all *i* ← 1 to *K*imp do <sup>25</sup> *<sup>L</sup>*<sup>2</sup> <sup>←</sup> *<sup>L</sup>*2<sup>σ</sup>

ADDIMPLICATION(¬*L*1, *L*2, *C*)

procedure UNTRACKCLAUSE(*C*) for all *<sup>L</sup>*a,*L*c, *<sup>M</sup>* such that (*L*c, *<sup>M</sup>*) <sup>∈</sup> *Imp*[*L*a] do if *C* ∈ *M* then <sup>30</sup> erase (*L*c, *<sup>M</sup>*) from *Imp*[*L*a]

The algorithm views a clause *L* ∨ *L* ′ as two implications ¬*L* −→■ *L* ′ and ¬*L* ′ −→■ *L*. It stores only one entry for all literals equal up to variable renaming (line 2). Each implication *L*<sup>a</sup> −→■ *L*<sup>c</sup> represented by the clause is stored only if its generalization is not present in *Imp* (line 4). Conversely, all instances of the implication are removed (line 6).

Next, the algorithm fnds each implication stored in *Imp* that can be linked to *L*<sup>a</sup> −→■ *L*c: Either *L*<sup>c</sup> becomes the new consequent (line 9) or *L*<sup>a</sup> becomes the new antecedent (line 13). If *L*<sup>c</sup> can be decomposed into *u*[*s*] ̸≈ *u*[*t*], rule (3) of Defnition 4 allows us to store *s* ̸≈ *t* in *Imp*[*L*a] (line 18). This is an exception to the idea that transitive chains should use only rule (1). The application of rule (3) does not count toward the bound *K*len. If *L*<sup>a</sup> is of the form *u*[*s*] ≈ *u*[*t*], then *Imp* could be extended so that *Imp*[*s* ≈ *t*] = *Imp*[*L*a], but this would substantially increase *Imp*'s memory footprint.

In frst-order logic, different instances of the same clause can be used along a transitive chain. For example, the clause *<sup>C</sup>* <sup>=</sup> <sup>¬</sup>p(*x*) <sup>∨</sup> <sup>p</sup>(f(*x*)) induces <sup>p</sup>(*x*) ↪→*<sup>i</sup>* <sup>p</sup>(<sup>f</sup> *i* (*x*)) for all *i*. The algorithm discovers such self-implications (line 23): For each clause *<sup>C</sup>* of the form <sup>¬</sup>*<sup>L</sup>* <sup>∨</sup> *<sup>L</sup>*σ, where σ is some nonidempotent substitution, the entires (*L*σ 2 ,{*C*}),...,(*L*σ *K*imp+1 ,{*C*}) are added to *Imp*[*L*], where *K*imp is a parameter.

To track and untrack clauses effciently, we implement the mapping *Imp* as a nonperfect discrimination tree [34]. Given a query literal *L*, this indexing data structure effciently fnds all literals *L* ′ such that for some σ, *<sup>L</sup>* ′σ <sup>=</sup> *<sup>L</sup>* and *Imp*[*<sup>L</sup>* ′ ] ̸= /0. We can use it to optimize all lookups except the one on line 9. For this remaining lookup, we add an index *Imp*−<sup>1</sup> that inverts *Imp*, i.e., *Imp*−<sup>1</sup> [*L*] = {*L* ′ | *Imp*[*L* ′ ] = (*L*, *<sup>M</sup>*) for some *<sup>M</sup>*}. To avoid sequentially going through all entries in *Imp* when the prover deletes them, for each clause *C* we keep track of each literal *L* such that *C* appears in *Imp*[*L*]. Finally, we limit the number of entries stored in *Imp*[*L*] – by default, up to 48 pairs in each *Imp*[*L*] are stored.

Rules HLE and HTR have a simple implementation based on *Imp* lookups. To implement UNITHLE and UNITHTR, we maintain the index *Unit*, containing literals *<sup>L</sup>*cσ, such that (*L*c, *<sup>M</sup>*) <sup>∈</sup> *Imp*[*L*a] for some *<sup>M</sup>* and *<sup>L</sup>*<sup>a</sup> and <sup>σ</sup> is the most general unifer of *L* ′ and *L*a, for some unit clause {*L* ′}. The implementation of FLE and FLR also uses *Unit*: When (*L* ′ , *<sup>M</sup>*) is added to *Imp*[*L*], we check if (¬*<sup>L</sup>* ′ , *M*′ ) ∈ *Imp*[*L*] for some *M*′ . If so, ¬*L* is added to *Unit*.

In propositional logic, the conventional approach constructs the *binary implication graph* for the clause set *N* [4], with edges (¬*L*,*<sup>L</sup>* ′ ) and (¬*L* ′ ,*L*) whenever *<sup>L</sup>* <sup>∨</sup> *<sup>L</sup>* ′ <sup>∈</sup> *<sup>N</sup>*. To avoid traversing the graph repeatedly, solvers rely on timestamps to discover connections between literals. This relies on syntactic literal comparisons, which is very fast in propositional logic but not in frst-order logic, because of substitutions and congruence.

#### *B. Predicate Elimination*

To implement portfolio predicate elimination, we maintain a record for each predicate symbol p occurring in the problem with the following felds: set of defnition clauses for p, set of nondefnition clauses in which p occurs once, and set of clauses in which p occurs more than once. These records are kept in a priority queue, prioritized by properties such as presence of defnition sets and number of estimated resolutions. If p is the highest-priority symbol that is eligible for SPE or DPE, we eliminate it by removing all the clauses stored in p's record from the proof state and by adding fat resolvents to the passive set. Eliminating a symbol might make another symbol eligible.

As an optimization, predicate elimination keeps track only of symbols that appear at most *K*occ times in the clause set. For inprocessing, we use signals that the prover emits whenever a clause is added to or removed from the proof state and update the records. At the beginning of the 1st, (*K*iter +1)st, (2*K*iter + <sup>1</sup>)st, ... iteration of the given clause procedure's loop body, predicate elimination is systematically applied to the entire proof state. The frst application of inprocessing amounts to preprocessing. By default, *K*occ = 512 and *K*iter = 10. The same ideas and limits apply for blocked clause elimination.

The most important novel aspect of our predicate elimination implementation is recognizing the defnition clauses for symbol p in a clause set *N*, which is performed as follows:


core of *E* and *G* ′ the set of corresponding frst-order clauses and continue.

4) If all resolvents in *G* ′ <sup>p</sup>⋊<sup>p</sup> *<sup>G</sup>* ′ ¬p are tautologies, then *G* ′ is the defnition set for symbol p. Else, report failure.

The invalidity of set *E* from step 3 is checked using a SAT solver, which is already integrated in Zipperposition. As modern theorem provers (such as E or Vampire) also use SAT solvers, the method can easily be implemented.

During experimentation, we noticed that recognizing defnitions of symbols that occur in the conjecture often harms performance. Thus, Zipperposition recognizes defnitions only for the remaining symbols.

#### VII. EVALUATION

We measure the impact of our elimination techniques for various values of their parameters. As a baseline, we use Zipperposition's frst-order portfolio mode, which runs the prover in 13 confgurations of heuristic parameters in consecutive time slices. None of these confgurations use our new techniques. To evaluate a given parameter value, we fx it across all 13 confgurations and compare the results with the baseline.

The benchmark set consists of all 13 495 CNF and FOF TPTP 7.3.0 theorems [17]. The experiments were carried out on StarExec servers [35] equipped with Intel Xeon E5-2609 CPUs clocked at 2.40 GHz. The portfolio mode uses a single CPU core with a CPU time limit of 180 s. The base confguration solves 7897 problems. The values in the tables indicate the number of problems solved minus 7897. Thus, positive numbers indicate gains over the baseline. The best result is shown in bold.

#### *A. Hidden-Literal-Based Elimination*

The frst experiments use all implemented HLBE rules. To avoid overburdening Zipperposition, we can enable an option to limit the number of tracked clauses for hidden literals. Once the limit has been reached, any request for tracking a clause will be rejected until a tracked clause is deleted. We can choose which kind of clauses are tracked: only clauses from the active set *A*, only clauses from the passive set *P*, or both. We also vary the maximal implication chain length *K*len and the number of computed self-implications *K*imp.

In Zipperposition, every lookup for instances or generalizations of *s* ≈ *t* must be done once for each orientation of the equation. To avoid this ineffciency, and also because the implementation of hidden literals does not fully exploit congruence, we can disable tracking clauses with at least one functional literal. Clauses containing functional literals can then still be simplifed.

Figures 1 and 2 show the results, without and with functional literal tracking enabled, for *K*len = 2 and *K*imp = 0.The columns specify different limits on the number of tracked clauses, with ∞ denoting that no limit is imposed. The rows represent different kinds of tracked clauses. The results suggest that tracking functional literals is not worth the effort but that tracking predicate literals is. The best improvement is observed when both active and passive clauses are tracked. Normally


Fig. 1. Impact of the number and kinds of tracked clauses on HLBE performance, when only predicate literals are tracked


Fig. 2. Impact of the number and kinds of tracked clauses on HLBE performance, when all literals are tracked

DISCOUNT-loop provers [26] such as Zipperposition do not simplify active clauses using passive clauses, but here we see that this can be effective. Figure 3 shows the impact of varying *K*len and *K*imp, when 500 clauses from the entire proof state are tracked. These results suggest that computing long implication chains is counterproductive.

#### *B. Predicate and Blocked Clause Elimination*

For defned predicate elimination, the number of resolvents grows exponentially with the number of occurrences of p. To avoid this expensive computation, we limit the applicability of PPE to proof states for which p is singular. According to our informal experiments, full PPE, without this restriction, generally performs less well.

Predicate elimination can be done using Khasidashvili and Korovin's criterion (K&K) or using our relaxed criterion with different values of *K*tol. Figure 4 shows the results for SPE and PPE used as preprocessors. Our numbers corroborate Khasidashvili and Korovin's fndings: SPE with K&K proves 70 more problems than the base, a 0.9% increase, comparable to the 1.8% they observe when they combine SPE with additional preprocessing. Remarkably, the number of additional proved problems more than doubles when we use our criterion with *<sup>K</sup>*tol <sup>&</sup>gt; 0, for both SPE and PPE.

Although this is not evident in Figure 4, varying *K*tol substantially changes the set of problems solved. For example, when *K*tol = 0, SPE proves 60 theorems not proved using *K*tol = 50. The effect weakens as *K*tol grows. When *K*tol = 100, SPE proves only 13 problems not found when *K*tol = 200. Similarly, the set of problems proved by SPE and PPE differs: When *K*tol = 25, 14 problems are proved by PPE but missed by SPE. Recognizing defnition sets is useful: PPE outperforms SPE regardless of the criterion.

Performing BCE and variable elimination until fxpoint increases the performance of SAT solvers [14]. We can check whether the same holds for superposition provers. In this experiment, we use the relaxed criterion with *K*tol = 25 and HLBE which tracks up to 500 clauses from any clause set, *K*len = 2, and *K*imp = 0. We use each technique as preprocessing and inprocessing.


Fig. 3. Impact of the parameters *K*len and *K*imp on HLBE performance


Fig. 4. Impact of the choice of criterion on predicate elimination performance

The results are summarized in Figure 5, where the + sign denotes the combination of techniques. We confrm the results obtained by Kiesl et al. about the performance of BCE as preprocessing: It helps prove 30 more problems from our benchmark set, increasing the success rate by roughly 0.4%. The same percentage increase was obtained Kiesl et al. Using BCE as inprocessing, however, hurts performance, presumably because of its incompatibility with the redundancy criterion.

For preprocessing, the combinations SPE+BCE and PPE+BCE performed roughly on a par with SPE and PPE, respectively. This stands in contrast to the situation with SAT solvers, where such a combination usually helps. It is also worth noting that the inprocessing techniques never outperform their preprocessing counterparts. The last column shows that combining HLBE with other elimination techniques overburdens the prover.

#### *C. Satisfability by Blocked Clause Elimination*

Kiesl et al. found that blocked clause elimination is especially effective on satisfable problems. To corroborate their results and ascertain whether a combination of predicate elimination and blocked clause elimination increases the success rate, we evaluate BCE on all 2273 satisfable or TPTP FOF and CNF problems. The hardware and CPU time limits are the same as in the experiments above. Figure 6 presents the results.

The baseline establishes the satisfability of 856 problems. We consider only preprocessing techniques, since BCE compromises refutational completeness—a saturation does not guarantee that the original problem was satisfable. We note that recognizing defnition sets makes almost no difference on satisfable problems. The sets of problems solved by BCE and PPE differ—30 problems are solved by BCE and not by PPE.

#### VIII. CONCLUSION

We adapted several preprocessing and inprocessing elimination techniques implemented in modern SAT solvers so that they work in a superposition prover. This involved lifting the techniques to frst-order logic with equality but also tailoring them to work in tandem with superposition and its redundancy criterion. Although SAT solvers and superposition provers embody radically different philosophies, we found that the lifted SAT techniques provide valuable optimizations.


Fig. 5. Performance of predicate and blocked clause elimination


Fig. 6. Performance of predicate and blocked clause elimination for establishing satisfability

We see several avenues for future work. First, the implementation of hidden literals could be extended to exploit equality congruence. Second, although inprocessing blocked clause elimination is incomplete in general, we hope to achieve refutational completeness for a substantial fragment of it. Third, predicate and blocked clause elimination, which thrives on the absence of clauses from the proof state, could be enhanced by tagging and ignoring generated clauses that have not yet been used to subsume or simplify untagged clauses. Fourth, predicate and blocked clause elimination could be extended to work with functional literals. Fifth, more SAT techniques could be adapted, including bounded variable addition [36] and blocked clause addition [37]. Sixth, the techniques we covered could be adapted to work with other frst-order calculi, or generalized further to work with higher-order calculi such as combinatory superposition [38] and λ-superposition [39].

#### *A. Acknowledgment*

We are grateful to the maintainers of StarExec for letting us use their service. Uwe Waldmann participated in the search for a counterexample to completeness of BCE as inprocessing and confrmed that Example 23 is correct. He also suggested major simplifcations and helped us debug the proofs of the claims about predicate elimination. Anne Baanen helped us defne the nonclassical logic used to disallow splitting. Ahmed Bhayat, Armin Biere, Mathias Fleury, Benjamin Kiesl, and the anonymous reviewers made some useful comments on our manuscript, and Mark Summerfeld suggested many textual improvements. We thank them all.

Vukmirovic and Blanchette's research has received fund- ´ ing from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 713999, Matryoshka). Blanchette's research has also received funding from the Netherlands Organization for Scientifc Research (NWO) under the Vidi program (project No. 016.Vidi.189.037, Lean Forward). Heule is supported by the National Science Foundation (NSF) under grant CCF-2015445.

#### REFERENCES


# SAT Solving in the Serverless Cloud

Alex Ozdemir§ , Haoze Wu§ , and Clark Barrett

Stanford University, USA.

{aozdemir, haozewu, barrett}@cs.stanford.edu

*Abstract*—In recent years, cloud service providers have sold computation in increasingly granular units. Most recently, "serverless" executors run a single executable with restricted network access and for a limited time. The beneft of these restrictions is scale: thousand-way parallelism can be allocated in seconds, and CPU time is billed with sub-second granularity. To exploit these executors, we introduce **gg-SAT**: an implementation of divide-and-conquer SAT solving. Infrastructurally, **gg-SAT** departs substantially from previous implementations: rather than handling process or server management itself, **gg-SAT** builds on the **gg** framework, allowing computations to be executed on a confgurable backend, including serverless offerings such as AWS Lambda. Our experiments suggest that when run on the same hardware, **gg-SAT** performs competitively with other D&C solvers, and that the 1000-way parallelism it offers (through AWS Lambda) is useful for some challenging SAT instances.

*Index Terms*—parallel SAT, serverless computing, divide and conquer.

### I. INTRODUCTION

Modern Boolean satisfability (SAT) solvers have been successfully applied to important practical and theoretical domains, such as hardware verifcation, planning, and mathematics. Progress in the scalability of these tools has come from both algorithmic improvements and better leveraging of multi-processing hardware. While the number of processors on a single machine is limited, and maintaining a warm cluster to run occasional tasks is expensive, cloud-computing is a promising approach for leveraging on-demand parallelism at low cost.

Recent cloud-computing services are offered at increasingly fne granularity and low latency. Instead of renting a server or a cluster, one can now rent state-free executors, which can be rapidly and plentifully provisioned at a low price a paradigm referred to as *serverless computing*. Serverless executors generally have restricted network access, limited memory, and limited runtime. For example, Amazon's Lambda service rents a Linux container to run arbitrary x86-64 executables for up to 15 minutes, with less than a second of startup time and no charge when idle. Similar services are offered by Google, Microsoft, Alibaba, and IBM. Previous research has used serverless computing as a "burstable supercomputer" for video processing [2], neural network training [25], and more [13]–[15], [33]. These successes beg the question: "can serverless computing be leveraged for massively parallel SATsolving?"

There are two traditional parallel SAT-solving paradigms: 1) the portfolio approach, where each thread runs a different

§Equal contribution

SAT solver on the same instance; and 2) the divide-andconquer (D&C) approach, where a problem is partitioned into independent sub-problems to be solved in parallel. While the former approach in combination with clause-sharing leads to surprisingly good performance for small portfolio sizes, the benefts decrease as parallel computing power increases, and this approach is also not well aligned with the runtime and communication limitations of serverless executors. In this paper, we follow the second approach and present gg-SAT, a divide-and-conquer (D&C) SAT solver compatible with serverless computing. gg-SAT makes black-box use of a *solver* (e.g., CaDiCaL [8]) and a *divider* (e.g., march [28]) to solve and partition the problems, respectively. Problem division is performed throughout the search, whenever a subproblem reaches a timeout imposed by either the user or the cloud-service. Infrastructurally, gg-SAT differs substantially from previous D&C implementations: rather than handling process or server management itself, gg-SAT builds on top of the gg framework for parallel computation. By expressing D&C search using gg, gg-SAT can execute that search on any mixture of user-specifed backends; supported backends currently include local processes, remote machines, and serverless cloud-services such as AWS Lambda and Google Cloud Functions. To implement gg-SAT, we designed and built pygg, a novel and idiomatic Python interface to gg. We expect that pygg will be independently useful for other future projects, perhaps including parallel SMT solving.

We evaluate gg-SAT using local processes and AWS Lambda as backends. Local experiments suggest that gg-SAT performs competitively with the original Cube-and-Conquer prototype [19], a recent reimplementation of it [18], and a portfolio solver PLingeling [7], on benchmarks taken from [18], [19]. Cloud experiments suggest that gg-SAT unlocks levels of parallelism which are useful for solving some challenging instances from the 2020 SAT Competition.

### II. BACKGROUND & RELATED WORK

#### *A. Parallel SAT*

Propositional satisfability is an old problem; we refer the reader to the handbook of satisfability [9] for an introduction. Parallel SAT-solving also has a lengthy history, with two main approaches.

The frst approach is *portfolio solving*, pioneered in [16], [22], [34]. In a portfolio solver, each thread runs a different solver or confguration on the same original formula. An instance is solved as quickly as the best individual solver for that instance. Portfolio solvers include: ManySAT

[17], CryptoMinisat [32], PLingeling [7], Syrup [3], HordSAT [6], and Painless [26]. Some portfolio solvers also use *clause sharing* [11], [31]: sharing learnt clauses among the different solvers.

Another approach to parallelizing SAT is *divide-andconquer* (D&C). D&C solvers attempt to divide a SAT instance into easier SAT instances, which can then be solved in parallel by a base solver. Typically, D&C solvers divide instances by partitioning the search space. The important questions how and when to divide—are answered heuristically, typically with heuristics derived from look-ahead solvers and CDCL solvers. There has been substantial work on D&C SAT solving [10], [23], [24], including: Psato [35], Painless [27], and AMPHAROS [29]. One prominent approach, "cube-andconquer" [19] uses a lookahead solver to divide instances and a CDCL solver to solve subproblems; this approach has been successful for large mathematical problems [21].

#### *B. Distributed SAT*

A number of systems attempt parallel SAT solving using a cluster of computers, possibly rented from the cloud. Most of these systems (Qsat [30], HordSAT [6], TopoSAT [12], SLIME [20]) follow the portfolio approach. One recent system (Paracooba [18]) follows the D&C approach. All of these systems operate in the "cluster" computational model, in which long-running processes on each node communicate over the network.

#### *C. Serverless Computing*

Cloud service providers, such as Microsoft Azure, rent out computational resources including compute, storage, and accelerators. Over the past decade, service providers have rented compute with increasing granularity, scale, and availability. Their recent offerings include *serverless* services, which run a single executable for a limited time, with limited memory and restricted network access. While restricted, serverless computing has strengths: it offers massive parallelism that can be rapidly provisioned, with fne-grained billing. For example, AWS Lambda [4] runs executables for up to 15 minutes, with 3GB of memory and 500MB of disk space; the runs are billed at sub-second granularity, and a thousand executors can be provisioned in seconds.

While serverless computing was designed for operational convenience, recent work has explored using it as a "burstable supercomputer-on-demand" [13], for tasks such as video processing [2], ray tracing [14], and machine learning [25]. One system, gg [13], provides a general framework for leveraging minimal executors (including serverless ones). It uses a confgurable backend (such as a local machine, remote machines, or serverless executors) to evaluate a programmerdefned dependency graph of *thunks*: programs that take fles as inputs. Thunks can output fles or new thunks; the latter causes the dependency graph to dynamically grow. Dynamic dependency graphs can express many applications; gg has been used for tasks such as neural network verifcation [33], compilation [13], and video encoding [15].

(a) The D&C search tree. ϕ's solve query times out and is split into three sub-problems, one of which has been solved.

(b) The gg dependency graph. Dashed arrows denote dependencies; if a node produces multiple outputs, the dependency edges are labelled. The solid arrow denotes a thunk that returns another thunk. Shaded thunks have been evaluated.

Fig. 1: A D&C search snapshot and its corresponding dependency graph. In both diagrams, S, M, and D denote solve, merge, and divide, respectively.

#### III. DESCRIPTION

#### *A. Algorithm*

gg-SAT uses a D&C algorithm with multiplicatively growing timeouts. It is parameterized by a *base solver* and a *divider*. The base solver can be any SAT solver. The divider's job is to partition a problem into a requested number of sub-problems such that the disjunction of the sub-problems is equisatisfable with the original problem. Other parameters to the algorithm include the timeout t, the timeout growth factor f, the number of initial partitions p<sup>i</sup> , and the number of partitions for each sub-problem, ps.

Figure 1a illustrates the solving of formula ϕ as a tree, with p<sup>i</sup> = 1 and p<sup>s</sup> = 3. The number of initial divisions is 1, so the base solver frst attempts the original problem ϕ with timeout t. This times out, so the divider runs and splits ϕ into sub-problems (ϕ0, ϕ1, ϕ2), each of which is attempted with timeout f t. The sub-problem ϕ<sup>0</sup> is determined to be UNSAT; other sub-problems have yet to be solved, and may be divided again. The process ends when all sub-problems are determined to be UNSAT or any sub-problem is determined to be SAT.

#### *B. Implementation*

To apply D&C to SAT, we must instantiate its primitive notions (sub-problems, solving, and dividing) for SAT. We follow previous work [19], [24] by using a lookahead solver (march) to build sub-problems described by *cubes* (lists of asserted literals) and by using a CDCL solver (CaDiCaL [8]) to attempt to solve problems and sub-problems. march can

Fig. 2: gg-SAT expresses D&C search as a dynamically expanding dependency graph and uses gg to evaluate that graph using a back-end of the user's choice.

produce a large number of cubes (e.g., millions) and can take a long time. This was appropriate for cube-and-conquer (which ran march exactly once per problem) but is inappropriate for divide-and-conquer (which runs march many times seeking a small number of sub-problems each time). To address this, we confgure march with a maximum cube length, which substantially reduces its runtime.

Our D&C implementation uses the gg framework for parallel execution [13]. Recall (§II) that using gg requires the computation to be expressed as a dependency graph of thunks, each of which is an individual executable. For D&C, there are three kinds of thunks. *Solve thunks* run the base solver; if it returns a result, the thunk returns that result as well; otherwise, the solve thunk returns a *merge thunk*, which combines the solutions to sub-problems that are produced by a *divide thunk*, which runs the divider. Figure 1 illustrates the relationship between an in-progress D&C search and the gg dependency graph. When D&C attempts to solve S(ϕ, t), the dependency graph contains only the nodes left of the dotted line. However, when that query times out, the corresponding thunk returns 5 new thunks: a divide thunk to create 3 sub-problems, three solve thunks to (attempt to) solve them, and a merge thunk, whose output should be taken as the output of the original S thunk.

By expressing D&C search as a gg dependency graph, we can use gg to execute that search using a back-end (or combination of back-ends) of the user's choice. Figure 2 visualizes the different runtime components of the system. Our driver translates the D&C search tree into a graph. The reductor analyzes this graph, searching for thunks whose dependencies are fully evaluated; it sends these to a confgured backend. When an executor returns values or subgraphs, the reductor updates its graph. When the graph is reduced to a single value, the reductor returns that value to the driver. For more details about the execution process, see [13].

To ease the development of gg-SAT, we built pygg, a python library for building dynamic gg dependency graphs. While gg is conceptually simple, using it typically requires programmers to write many different shell scripts for tasks such as embedding values in the gg graph, creating different kinds of thunks, and reformatting fles for different solvers. With pygg, the entire computation can be expressed as a single python script. Different kinds of thunks are just different python functions, each of which can return basic python values, one or more fles, or the output of some combination of other thunks. With pygg, our D&C implementation fts in a single python script of less than 200 lines. pygg has been merged upstream into the gg project.

#### IV. EXPERIMENTS

gg-SAT is the frst SAT solver targeting serverless computation, so we cannot compare with previous tools on our infrastructure of interest. Nonetheless, we perform two experiments. First, we compare gg-SAT with other multithreaded solvers on a single multicore machine, to validate the general architecture and performance of gg-SAT. Second, we use 1000 serverless executors to attempt unsolved benchmarks from the SAT 2020 competition, showing the utility of the massive parallelism that gg-SAT unlocks.

#### *A. Local experiment*

We compare with the default confgurations of three parallel solvers: 1) the original Cube-and-Conquer prototype (denoted CnC) 1 [19]; 2) Paracooba<sup>2</sup> [18], a recent Cube-and-Conquer re-implementation that is optimized for distributed computing; 3) Treengeling <sup>3</sup> [8], a divide-and-conquer SAT solver; and 4) PLingeling [8], a state-of-the-art portfolio SAT solver. We evaluate on the benchmarks reported in [18], [19]. We run gg-SAT with p<sup>i</sup> = 64, p<sup>s</sup> = 4, t = 10, and f = 1.5, a set of parameters empirically determined to work well. For the other four solvers, we use the default parameters except that the number of threads is set to 64. Our testbed machines have two 2.70GHz Xeon Platinum 8280 CPUs, running CentOS 7. Each job is run with a 256 GB memory limit, and a 1-hour wall-clock timeout.

Table I shows the solvers' wall-clock runtime for each benchmark. Given the small set of benchmarks, we can

<sup>1</sup>https://github.com/marijnheule/CnC/tree/ee8f8aab3729b46bc92dc

<sup>2</sup>https://github.com/maximaximal/Paracooba/tree/d905b67304eb780

<sup>3</sup>https://github.com/arminbiere/lingeling/tree/7d5db72420b95ab (same for PLingeling)

TABLE I: Runtime (s) of gg-SAT, CnC, Paracooba, Treengeling, and PLingeling on the benchmarks reported in [18], [19]


draw only limited conclusions. Nonetheless, the results suggest gg-SAT's performance is reasonable. It solves more benchmarks than the other three divide-and-conquer solvers, corroborating past research [1] that interleaving look-ahead with CDCL can be benefcial. It also solves more than PLingeling, suggesting that the divide-and-conquer approach can be preferable to the portfolio approach in some cases. Note, however, that each other solver can solve at least one benchmark that gg-SAT cannot, suggesting that the approaches are complementary.

#### *B. Serverless experiment*

Our second experiment demonstrates the utility of the thousand-way parallelism that gg-SAT makes convenient. We fnd that with this parallelism, gg-SAT can solve challenging instances that are out of reach for solvers running at lower levels of parallelism.

We sample 8 instances from the Cloud track of the SAT Competition 2020 [5], none of which were solved during the competition.<sup>4</sup> As summarized in Table II, four of the fve solvers from the previous section (using the same confgurations) are unable to solve any of these instances within 4 hours. Treengeling solves one instance, Steiner-81-21-bce, in 9331 seconds. However, with gg-SAT running on AWS Lambda with 1000-way parallelism, we fnd that three instances: Steiner-81-21-bce, bv-term-small-rw\_350.smt2, and mulhs16.smt2 are UNSAT in 2559, 1455, and 2866 seconds respectively. For AWS Lambda, we confgure gg-SAT with p<sup>i</sup> = 1024, p<sup>s</sup> = 8, t = 10, and f = 1.5. 5

```
4Steiner-81-21-bce, abw-I-ash85.mtx-w24,
ccp-s8-facto4, bv-term-small-rw_350.smt2,
Steiner-405-71-bce, mulhs16.smt2,
LED_round_29-32_faultAt_29_fault_injections_5_
seed_1579630418, PRESENT_round_1-32_faultAt_30_
fault_injections_10_seed_1579630418
```
<sup>5</sup>Our experiment is incomparable with the results of the 2020 SAT cloud track. The competition environment differs substantially from our testbed; it uses 1600 cores, 20 minutes, and different hardware.

TABLE II: Solver performance on 8 hard instances from the SAT Competition 2020


#### V. DISCUSSION

We have presented gg-SAT, a parallel D&C SAT solver compatible with serverless-computing. gg-SAT is built on top of gg, an infrastructure for evaluating parallel computations. gg-SAT appears competitive with other parallel SAT solvers, and easily unlocks ad-hoc large-scale parallelism through execution on serverless cloud-services. This massive parallelism appears to be effective in solving some challenging instances. To implement gg-SAT, we also built pygg, a novel python interface to gg, which we hope will be useful for other applications, such as parallel SMT solving.

*Future Work:* gg-SAT itself could be substantially improved. Currently, its search strategy (e.g., how many subproblems to create, when to re-divide) is independent of the number of idle workers and the number of unsolved problems. This can cause one of two undesirable dynamics: most workers sitting idle while a few tackle challenging sub-problems (that would ideally be immediately divided) or too much time being spent re-dividing (even though all workers are already busy). In the future, we hope to adjust the search strategy depending on the current workload of the system, dividing more when workers are idle, and less when they are not. We suspect that this will improve performance while also reducing the number of parameters for the system.

Other future directions for gg-SAT include proofgeneration, new dividers, and trying to retain useful clauses from failed base solver attempts.

#### REFERENCES


# Induction with Recursive Definitions in Superposition

Marton Hajdu ´ ∗ , Petra Hozzova´ ∗ , Laura Kovacs ´ ∗ and Andrei Voronkov† <sup>∗</sup>TU Wien †University of Manchester and EasyChair

*Abstract*—Functional programs over inductively defined data types, such as lists, binary trees and naturals, can naturally be defined using recursive equations over recursive functions. In first-order logic, function definitions can be considered as universally quantified equalities. Verifying functional program properties therefore requires inductive reasoning with both theories and quantifiers. In this paper we propose new extensions and generalizations to automate induction with recursive functions in saturation-based first-order theorem proving, using the superposition calculus. Instead of using function definitions as first-order axioms, we introduced new simplification rules for treating function definitions as rewrite rules. We guide inductive reasoning and strengthen induction schema using recursively defined functions. Our experimental results show that handling recursive definitions in superposition reasoning significantly improves automated reasoning with induction.

### I. INTRODUCTION

Automated reasoning has become the backbone of formal software development [1]. Automating inductive reasoning is of increasing importance for emerging applications in software verification, in particular in the context of functional programming and inductive/algebraic data types (also called term algebras), such as natural numbers, lists and binary trees. Functional programs can be typically described by recursive equations/functions over algebraic data types, as illustrated in Figure 1. On the other hand, algebraic data types are, for example, commonly used in security applications to encode uniqueness of hash functions [2] or to express non-interference properties preventing information flow between private/public channels [3]. Formalizing such properties requires full firstorder logic with theories, and automating their validation requires inductive reasoning.

Previous works on automating induction mainly focus on inductive theorem proving [4], [5], [6], [7], [8], [9], [10], [11]: deciding when induction should be applied and what induction axiom should be used. Further restrictions are made on the logical expressiveness, for example induction over only universal properties [7], [9], [6], term algebras [12] or Horn clauses [13]. Recent advances related to automating inductive reasoning, such as first-order reasoning with inductively defined data types [14], inductive strengthening of SMT properties [15], structural induction in first-order theorem proving [16], [17], [18], [12], open up new possibilities for automating induction. *In this paper we focus on first-order theorem proving and automate induction by integrating it directly into the proof search algorithm of first-order theorem proving.* The program assertions from lines 17–18 of Figure 1 show what we strive for: validating first-order properties over algebraic data types, such as binary trees, lists and naturals, involving additional recursive function definitions and predicates, such as even, mul, app, flat and aflat. We prove such and similar inductive properties by using saturation-based proof search based on the superposition calculus [19], which is the leading technology in automated theorem proving [20], [21], [22].

Reasoning about inductively defined data types with recursive definitions. Our work targets full and efficient automation of induction with recursive function reasoning, as illustrated in a toy ML-like functional program of Figure 1. Lines 1–3 of Figure 1 declare respectively the algebraic data types of natural numbers nat, lists list and binary trees bt, using constructors. In first-order logic, these data types correspond to term algebras [14]. Functional programs over data types can be defined by recursive equations, for example lines 4-5 of Figure 1 define the addition add of two natural numbers x, y (in first-order logic, function definitions can be considered as universally quantified equalities). Verifying the correctness of Figure 1 requires then to prove the formulas of lines 17- 18, which asserts the equivalence of two functions over binary trees (line 17) and even properties of naturals (line 18). Automating reasoning about properties of inductively defined data types like nat, list and bt needs to handle acyclicity already for equational properties (which, in general, is not finitely axiomatizable) and induction. Our recent results on reasoning with inductively defined data types and induction [14], [18] enable induction in superposition-based theorem proving, yet only by applying induction over one clause at a time. Our work builds upon these results and brings novel extensions for handling recursive functions and (generalized) induction on arbitrarily many clauses simultaneously.

Our contributions. This paper brings the following contributions.


```
1 datatype nat = zero | s of nat
2 datatype list = nil | cons of nat list
3 datatype bt = leaf | node of bt nat bt
4 add zero y = y
5 add (s x) y = s (add x y)
6 mul zero y = zero
7 mul (s x) y = add (mul x y) y
8 even zero
9 ¬even (s zero)
                                             10 even (s (s x)) ↔ even x
                                             11 app nil z = z
                                             12 app (cons x y) z = cons x (app y z)
                                             13 flat leaf = nil
                                             14 flat (node x y z) = app (flat x) (cons y (flat z))
                                             15 aflat leaf u = u
                                             16 aflat (node x y z) u = aflat x (cons y (aflat z u))
                                             17 assert (∀x, y)(app (flat x) y = aflat x y)
                                             18 assert (∀x, y)(even y → even (mul x y))
```
Fig. 1. Motivating example with recursive definitions over algebraic data types.

tion become inference rules of the saturation process, adding instances of appropriate induction schemata.


Structure of the paper. The rest of the paper is organized as follows. We illustrate the challenges of automating induction with recursive definitions in superposition reasoning in Section II. We present our induction formula generation method in Section IV. Section V describes inductive reasoning with recursive definitions, whereas Section VI generalizes our work to induction with multiple premises. After summarizing our experimental findings in Section VII, we overview related work in Section VIII. We conclude the paper in Section IX.

#### II. MOTIVATING EXAMPLE

We first motivate our work using the functional program of Figure 1 over naturals, lists and binary trees.

Example 1 (Inductive reasoning with lists and binary trees). Using the recursive function definition app over lists, and recursive function definitions flat and aflat over binary trees (lines 11–16 of Figure 1), we first focus on proving the equivalence of functions flat and aflat flattening binary trees to lists, specified as an assertion at line 17 of Figure 1. For easing readability, we write this assertion in infix notation as below:

$$\forall u, v. \mathtt{app}(\mathtt{f1at}(u), v) = \mathtt{af1at}(u, v) \tag{1}$$

Proving (1) requires induction over binary trees, using for example the structural induction formula

$$\begin{aligned} \left( F[\mathtt{1} \mathtt{a} \mathtt{af}] \land \forall x, y, z. \left( (F[x] \land F[z]) \right) \\ \to F[\mathtt{n} \mathtt{ode}(x, y, z)] \right) \to \forall u. F[u], \end{aligned} \qquad (2)$$

where F[x] denotes a first-order formula over x. By instantiating (2), proving (1) reduces to proving two formulas: the base case and the step case. The base case,

$$\forall v.\mathsf{app}(\mathtt{f1at}(\mathtt{1eaf}), v) = \mathtt{af1at}(\mathtt{1eaf}, v),\qquad(3)$$

holds by the recursive definitions at lines 11, 13 and 15 of Figure 1. For the step case, we strengthen the hypotheses by replacing v with fresh universally quantified variables v0, v1:

$$\forall x, y, z, v. \left(\forall v\_0. \mathtt{app}(\mathtt{f1at}(x), v\_0) = \mathtt{af1at}(x, v\_0) \land \right.\tag{4}$$

$$\forall v\_1. \mathtt{app}(\mathtt{f1at}(z), v\_1) = \mathtt{af1at}(z, v\_1) \to \qquad (5)$$

$$\mathtt{app}(\mathtt{f1at}(\mathtt{node}(x,y,z)),v) = \mathtt{af1at}(\mathtt{node}(x,y,z),v)\bigg) \quad (6)$$

For proving (6), we first use the recursive definitions at lines 14 and 16 of Figure 1 to obtain (omitting (4), (5) and implicit universal quantification):

$$\begin{aligned} \mathtt{app}(\mathtt{app}(\mathtt{f1at}(x), \mathtt{cons}(y, \mathtt{f1at}(z))), v) &= \\ \mathtt{af1at}(x, \mathtt{cons}(y, \mathtt{af1at}(z, v))) \end{aligned} \quad (7)$$

By rewriting (7) with (4) and (5), we are left with proving:

$$\begin{aligned} \mathtt{app}(\mathtt{app}(\mathtt{f1at}(x), \mathtt{cons}(y, \mathtt{f1at}(z))), v) &= \\ \mathtt{app}(\mathtt{f1at}(x), \mathtt{cons}(y, \mathtt{app}(\mathtt{f1at}(z), v))) \end{aligned} \quad (8)$$

By replacing flat(x) with a fresh variable w in (8), we obtain

$$\begin{aligned} \mathtt{app}(\mathtt{app}(w, \mathtt{cons}(y, \mathtt{f1at}(z))), v) &= \\ \mathtt{app}(w, \mathtt{cons}(y, \mathtt{app}(\mathtt{f1at}(z), v))) \end{aligned} \qquad (9)$$

which is a generalized/stronger formula than (8). By applying the structural induction formula over lists

$$\left(F[\mathtt{nil}1] \land \forall x, y. (F[y] \to F[\mathtt{cons}(x, y)])\right) \to \forall z. F[z]$$

over w in (9), we derive the validity of (9) by also using the definition of app from lines 11-12 in Figure 1. We thus conclude that (1) holds, and hence the assertion at line 17 of Figure 1 is valid.

While the proof above is quite natural for humans, it is very difficult for saturation-based first-order provers using the superposition calculus. For example, the state-of-the-art solvers supporting induction CVC4 [15], ZIPPERPOSITION [16] and VAMPIRE [17] fail proving (1). To organize proof search, saturation-based theorem provers, intuitively speaking, disallow rewriting small terms into big terms w.r.t. some ordering. In most (simplification) orderings used by these provers, the terms flat and aflat in (6) cannot be expanded using their recursive definitions, as the right-hand sides of these definitions are heavier/bigger<sup>1</sup> than their left-hand sides. Moreover, deciding the order in which induction hypotheses should be applied, such as (4) and (5), is as difficult as doing the proof itself. In this paper, *we extend superposition reasoning with special treatment of recursive definitions, guiding the generation of induction formulas during saturation (Section IV). We use rewrite rules for terms occurring in recursive definitions and inductive hypotheses (Section V)*. Thanks to this extension, our work can easily validate (1).

Another challenging aspect of induction with recursive definitions comes with generalizing and adjusting induction formulas over recursively defined terms and multiple premises, as illustrated next.

Example 2 (Inductive reasoning with naturals). Using the recursive function and predicate definitions of add, mul, and even from lines 4–10 of Figure 1, the assertion at line 18 encodes the following first-order formula over naturals:

$$\forall x, y. \mathsf{evsen}(y) \to \mathsf{even}(\mathsf{mu1}(x, y))\tag{10}$$

Similarly as in Example 1, proving (10) requires instantiating a structural induction formula for naturals as below:

$$\left(F[\mathtt{zero}] \land \forall z. (F[z] \to F[\mathtt{s}(z)])\right) \to \forall x. F[x] \tag{11}$$

and thereby proving the following two formulas:

$$\forall y. \mathsf{even}(y) \rightarrow \mathsf{even}(\mathsf{mu1}(\mathsf{zero}, y)) \tag{12}$$

$$\begin{aligned} \forall z, y. \left( \left( \mathsf{even}(y) \to \mathsf{even}(\mathsf{mu1}(z, y)) \right) \to \\ \quad \left( \mathsf{even}(y) \to \mathsf{even}(\mathsf{mu1}(\mathsf{s}(z), y)) \right) \right) \end{aligned} \quad (13)$$

Validity of the formula (12) follows from the recursive function definitions in lines 6 and 8 of Figure 1. By using the recursive definition in line 7 of Figure 1, formula (13) reduces to

$$\forall z, y. \left(\mathsf{even}(\mathsf{mu1}(z, y)) \rightarrow \mathsf{even}(\mathsf{add}(\mathsf{mu1}(z, y), y))\right) \quad (14)$$

The antecedent of (14) cannot however be used for proving its conclusion. We overcome this limitation by replacing/generalizing mul(z, y) in (14) with a fresh new variable u and instantiating the following variant of (11):

$$\begin{cases} \left( F[\mathbf{z}\mathbf{zero}] \land F[\mathbf{s(z}\mathbf{zero})] \land \forall z. (F[z] \to F[\mathbf{s(s(z))]}) \right) \\ \to \forall x. F[x] \end{cases} (15)$$

While (11) cannot be used to prove (14), note that (15) enables the application of the recursive definition of even in line 10 of Figure 1. As such, proving the generalized version of (14) reduces to proving the three formulas:

$$\mathsf{even}(\mathsf{zero}) \to \mathsf{even}(\mathsf{add}(\mathsf{zero}, y)) \tag{16}$$

$$\mathtt{even}(\mathtt{s}(\mathtt{zero})) \to \mathtt{even}(\mathtt{add}(\mathtt{s}(\mathtt{zero}), y)) \tag{17}$$

$$\begin{array}{c} \forall z. \Big( \begin{pmatrix} \mathsf{even}(z) \rightarrow \mathsf{even}(\mathsf{add}(z,y)) \end{pmatrix} \rightarrow \\ \begin{pmatrix} \mathsf{even}(\mathsf{s}(\mathsf{s}(z))) \rightarrow \mathsf{even}(\mathsf{add}(\mathsf{s}(\mathsf{s}(z)),y)) \end{pmatrix} \Big) \end{array} \quad (18)$$

All three formulas can be proven by applying the recursive function definitions of add and even from Figure 1 and using induction with multiple premises over (18) (Section VI). In this paper, *we generate induction formula variants, such as* (15)*, based on recursive function/predicate definitions (Section IV) and support induction with multiple premises (Section VI)*, proving for example (10).

While relatively simple, Figure 1 illustrates the key challenges in automating induction with recursive definitions in superposition: (i) *strengthening and creating induction formulas* using recursive definitions (Section IV); (ii) *rewriting recursively defined terms* by their (function/predicate) definitions (Section V); and (iii) *applying induction with multiple premises* (Section VI). In what follows, we describe our solutions for these challenges.

#### III. PRELIMINARIES

We assume familiarity with *standard multi-sorted first-order logic with equality*. Functions are denoted with f, g, h, predicates with p, q, r, variables with x, y, z, u, v, w, and Skolem constants with σ, all possibly with indices. A term is *ground* if it contains no variables. By x and t we denote tuples of variables and terms, respectively.

We use the standard logical connectives ¬, ∨, ∧, → and ↔, and quantifiers ∀ and ∃. A *literal* is an atom or its negation. For a literal L, we write L to denote its complementary literal. A disjunction of literals is a *clause*. We reserve the symbol for the *empty clause* which is logically equivalent to ⊥. We denote the *clausal normal form* of a formula F by cnf(F). We call every term, literal, clause or formula an *expression*. We use the notation s E t to denote that s is a *subterm* of t and s / t if s is a *proper subterm* of t.

We use the words *sort* and *type* interchangeably. We distinguish special sorts called *inductive sorts*, function symbols for inductive sorts called *constructors* and *destructors*. We distinguish *recursive constructors*, which have at least one argument of the same sort as their return sort, from *base constructors*, which do not have any arguments of the same type as their return sort. We call the ground terms built from the constructor symbols of a sort its *term algebra*.

We axiomatise term algebras using their *injectivity*, *distinctness*, *exhaustiveness* and *acyclicity* axioms [14]. In this paper, we refer to term algebras also as algebraic data types or inductively defined data types.

We write E[s] to denote that expression E contains k distinguished occurrence(s) of the term s, with k ≥ 0. For simplicity, E[t] means that these occurrences of s are replaced by the term t. Further, E[t]<sup>p</sup>1...p<sup>k</sup> , with p<sup>1</sup> . . . p<sup>k</sup> ∈ {0, 1} k ,

<sup>1</sup>W.r.t. orderings of first-order provers.

is the expression obtained by replacing ith distinguished occurrence of s by t in E[s] iff p<sup>i</sup> = 1. We abbreviate E[t1] . . . [tn] with E[t].

A *substitution* θ is a mapping from variables to terms. A substitution θ is a *unifier* of two terms s and t if sθ = tθ, and is a *most general unifier* (*mgu*) if for every unifier η of s and t, there exists substitution µ s.t. η = θµ. We denote the mgu of s and t with mgu(s, t).

#### *A. Saturation-based proof search*

First-order theorem provers work with clauses, rather than with arbitrary formulas. Given a set S of input clauses, firstorder provers *saturate* S by computing all logical consequences of S with respect to a sound inference system I. The saturated set of S is called the *closure* of S and process of computing the closure of S is called *saturation* [22]. If the closure contains the empty clause , the original set S of clauses is unsatisfiable. A simplified saturation algorithm for inference system I is given below with a clausified goal F and clausified assumptions A as input:

$$\mathfrak{m}\quad passive := A \cup \{\neg F\}, active := \emptyset$$

<sup>2</sup> **while** passive 6= ∅:


Completeness and efficiency of saturation-based reasoning rely heavily on properties of select and I (lines 3 and 4). The *superposition calculus* [19] (denoted Sup) is the most common inference system employed by saturation-based firstorder theorem provers, such as E [20], VAMPIRE [22] and ZIPPERPOSITION [16]. The superposition calculus is *sound* and *refutationally complete*: for any unsatisfiable formula, the empty clause can be derived as a logical consequence. To organize saturation, first-order provers use simplification *orderings* on terms, which are extended to orderings over literals and clauses; for simplicity, we write for both the term ordering and its clause ordering extension. We write s .= t to mean that the orientation of the equality s = t is fixed (i.e., either s t or t s).

We make use of the following inference rules of Sup in this paper:

#### Binary resolution:

$$\frac{A \lor C \quad \neg B \lor D}{(C \lor D)\theta}$$

where θ is the mgu of A and B. Superposition:

$$\frac{l = r \lor C \quad s[l'] \neq t \lor D}{(s[r] \neq t \lor C \lor D)\theta} \quad \frac{l = r \lor C \quad s[l'] = t \lor D}{(s[r] = t \lor C \lor D)\theta}$$

where θ is the mgu of l and l 0 , rθ 6 lθ and tθ 6 s[l 0 ]θ. There are special cases of these rules, imposing more restrictions on the premises. One such case is when one of the premises of superposition is a unit clause, yielding the so-called *demodulation* rules, as given in Section V.

Given an ordering , a clause C is *redundant* with respect to a set S of clauses if there exists a subset S <sup>0</sup> of S such that S 0 is smaller than {C} (i.e., C S) and S 0 implies C. Redundant clauses can be eliminated during proof search without destroying completeness; *simplification and deletion rules* are used to remove redundant clauses.

# IV. INDUCTION FORMULAS OVER RECURSIVE DEFINITIONS IN SUPERPOSITION

We now describe our solution for generating induction formulas in saturation-based theorem proving. Unlike [7], [4], [16], [10], [11], [25], [26], we integrate induction directly in the saturation-based theorem proving using the superposition calculus. For doing so, we rely on [17], [18] and use the following sound *inference rule of induction*:

$$\frac{\overline{L[\overline{t}] \lor C}}{\mathtt{cnf}(F \to \forall \overline{y}. L[\overline{y}])} \text{ (Ind)},$$

where L is a ground literal, C is a clause, and F → ∀y.L[y] is a valid induction formula. Further, y is a tuple of variables and t is a tuple of induction terms, of the same size.

In [17], [18], the inference rule (Ind) has been used by considering the induction formulas as instances of mathematical and structural induction. In this paper, we go beyond these works and utilise recursive function/predicate definitions to derive induction formulas to be used in (Ind). For doing so, we first select terms in recursive definitions over which induction formulas will be generated in Section IV-A and strengthened in Section IV-B. Further, in Section VI we extend (Ind) to induction formulas with multiple premises.

#### *A. Generating Induction Formulas over Recursive Definitions*

A recursive function/predicate definition has a number of branches, characterized by one or more clauses. We assume that (i) a function definition clause contains exactly one equality with a fixed orientation, i.e., f(s) .<sup>=</sup> <sup>t</sup> <sup>∨</sup> <sup>C</sup>. Similarly, (ii) a predicate definition axiom contains one marked literal, i.e., (¬)ˆp(s) ∨ D, where pˆ denotes that p is marked/selected. Two clauses f(s1) .<sup>=</sup> <sup>t</sup><sup>1</sup> <sup>∨</sup> <sup>C</sup> and <sup>f</sup>(s2) .<sup>=</sup> <sup>t</sup><sup>2</sup> <sup>∨</sup> <sup>D</sup> belong to the same branch of f if f(s1) and f(s2) are variants of each other. Similarly, two clauses (¬)ˆp(s1) ∨ C and (¬)ˆp(s2) ∨ D belong to the same branch of p if p(s1) and p(s2) are variants of each other. We therefore characterize a recursive definition branch with its *characteristic term* f(s) or *characteristic atom* p(s). We write "branch f(s)" and "branch p(s)" to refer to the branches with the characteristic term f(s) and characteristic atom p(s), respectively. We denote the set of variable disjoint branches of a function f and predicate p with B<sup>f</sup> and Bp, respectively.

Definition 1 (Recursive Calls of Recursive Definitions). Let f be a recursive function and p a recursive predicate. The *set of* *recursive calls* corresponding, respectively, to the branch f(s) and the branch p(s) are defined as:

$$\begin{aligned} \mathcal{R}\_{\mathbf{f}(\overline{s})} &:= \bigcup\_{\mathbf{f}(\overline{s'}) \doteq t \vee C} \{ \mathbf{f}(\overline{s''})\theta \: \mid \: \mathbf{f}(\overline{s''}) \preceq t, \mathbf{f}(\overline{s'})\theta = \mathbf{f}(\overline{s}) \} \\ \mathcal{R}\_{\mathbf{p}(\overline{s})} &:= \bigcup\_{\mathbf{\dot{p}}(\overline{s'}) \vee C} \{ \mathbf{p}(\overline{s''})\theta \: \mid \: \mathbf{p}(\overline{s''}) \in C, \mathbf{p}(\overline{s'})\theta = \mathbf{p}(\overline{s}) \} \end{aligned}$$

The rest of this section only details the generation of induction formulas using recursive function definitions; recursive predicates are handled similarly. Given a recursive function f, we categorize its argument positions similarly to [16].

Definition 2 (Active Positions, Accumulators). If for any branch f(s) ∈ B<sup>f</sup> and f(s <sup>0</sup>) ∈ Rf(s) :


We denote the set of active and accumulator argument positions of f with If.

Example 3. Based on the functions app, flat and aflat from Figure 1 lines 11-16, we have:

$$\mathcal{B}\_{\textsf{app}} = \{\textsf{app}(\textsf{ni1}, z\_0), \ \textsf{app}(\textsf{cons}(x, y), z\_1)\}$$

$$\mathcal{B}\_{\textsf{flat}} = \{\textsf{flat}(\textsf{1eaf}), \ \textsf{flat}(\textsf{node}(x, y, z))\}$$

$$\mathcal{B}\_{\textsf{aflat}} = \{\textsf{af1at}(\textsf{1eaf}, u\_0), \ \textsf{if1at}(\textsf{node}(x, y, z), u\_1)\}$$

While Rapp(nil,z0) = Rflat(leaf) = Raflat(leaf,u0) = ∅, the second branches of the three functions have the following sets of recursive calls:

$$\begin{aligned} \mathcal{R}\_{\mathsf{app}(\mathsf{cons}(x,y),z\_1)} &= \{\mathsf{app}(y,z\_1)\} \\ \mathcal{R}\_{\mathsf{flat}(\mathsf{node}(x,y,z))} &= \{\mathsf{f1at}(x), \,\mathsf{f1at}(z)\} \\ \mathcal{R}\_{\mathsf{af1at}(\mathsf{node}(x,y,z),u\_1)} &= \begin{Bmatrix} \mathtt{af1at}(x,\mathsf{cons}(y,\mathsf{aff1at}(z,u\_1))), \\ \mathtt{af1at}(z,u\_1) \end{Bmatrix} \end{aligned}$$

Iapp = {1}, since y is a proper subterm of cons(x, y) but the second argument is not an accumulator since it remains z<sup>1</sup> in the only recursive call. The only argument position of flat is active, and therefore Iflat = {1}. Finally, aflat has one active and one accumulator argument position, hence Iaflat = {1, 2}.

Definition 3 (Induction Terms from Active and Accumulator Positions). Consider a recursive function f of arity n and a ground term f(c). The term f(c <sup>0</sup>) is a *generator term* iff (i) c 0 coincides with c in all positions from {1 ≤ i ≤ n} \ If, and (ii) c <sup>0</sup> contains fresh variables on positions from If.

The *induction case* of f(c) over branch f(s) ∈ B<sup>f</sup> is the two-tuple:

$$(\theta, \{\text{ngu}(\mathbf{f}(\overline{c'}), \mathbf{f}(\overline{s'})\theta) \mid \mathbf{f}(\overline{s'}) \in \mathcal{R}\_{\mathbf{f}(\overline{s})}\})$$

where θ := mgu(f(c <sup>0</sup>), f(s)).

The *case distinction* Θf(c) of f(c) is the set of induction cases of f(c) over each branch of f. We call {c<sup>i</sup> | i ∈ If} the *induction terms* of f(c).

Induction Formula over Active and Accumulator Terms. Using Definition 3, we guide induction formula generation over active and accumulator terms, as follows. Given a literal L[c] with zero or more occurrences of the terms c, we generate and add the following *induction formula over active and accumulator terms* to saturation-based proving:

$$(\forall) \bigwedge\_{(\theta, R) \in \Theta\_{t(\overline{c})}} \left( \bigwedge\_{\theta' \in R} L[\overline{c'}] \theta' \to L[\overline{c'}] \theta \right) \to L[\overline{c'}] \quad (19)$$

Since (19) is a valid induction formula, using it in the conclusion of (Ind) yields a sound (Ind) inference.

Example 4. For proving the assertion of line 17 from Figure 1 in a saturation-based framework, we consider its negation:

$$\operatorname{app}(\mathtt{f1at}(\sigma\_0), \sigma\_1) \neq \mathtt{af1at}(\sigma\_0, \sigma\_1) \tag{20}$$

Using Definition 3 and Iflat (Example 3), the generator term of flat(σ0) is t := flat(v). Moreover, by Bflat from Example 3, we obtain

$$\begin{aligned} \theta\_1 &= \mathsf{mgu}(t, \mathtt{f1at}(\mathtt{1eaf})) = \{v \mapsto \mathtt{1eaf}\} \\ \theta\_2 &= \mathsf{mgu}(t, \mathtt{f1at}(\mathtt{nodee}(x, y, z))) = \{v \mapsto \mathtt{nodee}(x, y, z)\} \end{aligned}$$

Applying the unifier θ<sup>2</sup> on the recursive calls of Rflat(node(x,y,z)) from Example 3 is a no-op, since the recursive calls do not contain v and we derive

$$\begin{aligned} \theta\_{2.1} &= \mathfrak{mg}(t, \mathtt{f1at}(x)) = \{v \mapsto x\} \\ \theta\_{2.2} &= \mathfrak{mg}(t, \mathtt{f1at}(z)) = \{v \mapsto z\} \end{aligned}$$

Using the case distinction

$$\Theta\_{\mathbf{f1at}(\sigma\_0)} = \{ (\theta\_1, \emptyset), \ (\theta\_2, \{\theta\_{2.1}, \theta\_{2.2}\}) \} \tag{21}$$

we derive the following induction formula:

$$\begin{aligned} & \forall x, y, z, u. \\ & \left( \left( \mathtt{app}(\mathtt{flat}(\mathtt{1at}), \sigma\_1) = \mathtt{aflat}(\mathtt{1eaf}, \sigma\_1) \land \\ & \left( \mathtt{app}(\mathtt{flat}(x), \sigma\_1) = \mathtt{aflat}(x, \sigma\_1) \land \\ & \mathtt{app}(\mathtt{flat}(z), \sigma\_1) = \mathtt{aflat}(z, \sigma\_1) \rightarrow \\ & \mathtt{app}(\mathtt{flat}(\mathtt{node}(x, y, z)), \sigma\_1) = \mathtt{aflat}(\mathtt{node}(x, y, z), \sigma\_1) \right) \right) \\ & \rightarrow \mathtt{app}(\mathtt{flat}(u), \sigma\_1) = \mathtt{aflat}(u, \sigma\_1) \end{aligned} \tag{22}$$

#### *B. Strengthening Induction over Recursive Definitions*

Induction hypotheses of induction formulas might not be strong enough to prove the corresponding induction step. A common technique to overcome such limitations is to strengthen the induction hypotheses: replace some terms in the hypotheses with universally quantified fresh variables, yielding thus logically stronger versions of induction hypotheses. Introducing universally quantified variables during saturation can however negatively impact the performance of the prover (e.g., yielding more unifications/rewriting steps). As a remedy to this practical burden in the context of recursive function definitions f, we utilize the *accumulator argument positions* from I<sup>f</sup> in Definition 3, which supersede the need for introducing universally quantified variables by implicitly instantiating these variables to the terms that will be matched by the recursive calls of f.

Example 5. The induction formula (22) is not strong enough to prove (20) and strengthening its induction hypotheses by replacing σ<sup>1</sup> with a universally quantified fresh variable – as in (4) and (5) from Example 1, – is inefficient. Instead, we use the term aflat(σ0, σ1) from (20) with the generator term t 0 := aflat(v, w) and induction terms {σ0, σ1}. We obtain the following unifiers:

$$\begin{aligned} \theta\_1' &= \mathsf{mgu}(t', \mathsf{af1at}(\mathsf{1eaf}, u\_0)) = \{v \mapsto \mathsf{1eaf}, w \mapsto u\_0\} \\ \theta\_2' &= \mathsf{mgu}(t', \mathsf{af1at}(\mathsf{nodee}(x, y, z), u\_1)) \\ &= \{v \mapsto \mathsf{nodee}(x, y, z), w \mapsto u\_1\} \end{aligned}$$

Applying θ 0 2 is once again a no-op on the recursive calls Raflat(node(x,y,z),u1) , and we get the unifiers:

$$\begin{aligned} \theta\_{2.1}' &= \mathsf{mgu}(t', \mathsf{af1at}(x, \mathsf{cons}(y, \mathsf{af1at}(z, u\_1)))) \\ &= \{v \mapsto x, w \mapsto \mathsf{cons}(y, \mathsf{af1at}(z, u\_1))\} \\ \theta\_{2.2}' &= \mathsf{mgu}(t', \mathsf{af1at}(z, u\_1)) = \{v \mapsto z, w \mapsto u\_1\} \end{aligned}$$

Thus we obtain the induction formula with the required induction hypothesis with term cons(y, aflat(z, u1)) that matches the conclusion after simplification:

$$\begin{array}{l} \forall x, y, z, u\_0, u\_1, v, w. \\ \left( \left( \mathsf{app}(\mathsf{f1at}(\mathsf{1eaf}), u\_0) = \mathsf{af1at}(\mathsf{1eaf}, u\_0) \land \\ \left( \mathsf{app}(\mathsf{f1at}(x), \mathsf{cons}(y, \mathsf{aff1at}(z, u\_1)) \right) = \\ \qquad \mathsf{af1at}(x, \mathsf{cons}(y, \mathsf{aff1at}(z, u\_1))) \land \\ \qquad \mathsf{app}(\mathsf{f1at}(z), u\_1) = \mathsf{af1at}(z, u\_1) \rightarrow \\ \qquad \qquad \mathsf{app}(\mathsf{f1at}(\mathsf{node}(x, y, z)), u\_1) = \mathsf{af1at}(\mathsf{node}(x, y, z), u\_1) \right) \\ \rightarrow \mathsf{app}(\mathsf{f1at}(v), w) = \mathsf{af1at}(v, w) \end{array} \tag{23}$$

After skolemizing x, y, z, u<sup>0</sup> and u<sup>1</sup> during clausification, binary resolving with (20), with v and w bound to σ<sup>0</sup> and σ1, respectively, we get the following ground induction hypotheses literals and ground conclusion literal from (23):

$$\begin{aligned} \mathtt{app}(\mathtt{flat}(\sigma\_2), \mathtt{cons}(\sigma\_3, \mathtt{aflat}(\sigma\_4, \sigma\_5))) &= \\ \mathtt{aflat}(\sigma\_2, \mathtt{cons}(\sigma\_3, \mathtt{aflat}(\sigma\_4, \sigma\_5))) \end{aligned} \quad (24)$$

$$\operatorname{app}(\mathfrak{f1at}(\sigma\_4), \sigma\_5) = \mathfrak{af1at}(\sigma\_4, \sigma\_5) \tag{25}$$

$$\begin{aligned} \mathtt{app}(\mathtt{flat}(\mathtt{node}(\sigma\_2, \sigma\_3, \sigma\_4)), \sigma\_5) \neq \\ \mathtt{aflat}(\mathtt{node}(\sigma\_2, \sigma\_3, \sigma\_4), \sigma\_5) \end{aligned} \qquad (26)$$

Further, the hypotheses of (23) are strong enough to prove (20), as shown in Section V.

In summary, we use Definition (3) to generate induction formulas over the active and accumulator terms from If. To further limit and guide the generation of induction formulas, we devised *heuristics* similar to [16]. Foremost, we only generate induction formulas from function/predicate terms with active occurrences.

Definition 4 (Active Term Occurrences). An occurrence of a term t in literal L is an *active occurrence* if (i) t is L, or (ii) L is an equality l = r and t is l or r, or (iii) the immediate superterm s of t is an active occurrence and the occurrence of t is in an active argument position of s.

As described in [18], apart from generalizing over complex terms as seen in Example (1), we can also generalize over active term occurrences. For example, we can refine the induction formula (19) to induct upon only certain occurrences of an induction term t with k occurrences in literal L, by using any bit vector p ∈ {0, 1} k and L[t]<sup>p</sup> instead of L[t].

# V. REFUTING INDUCTIVE PROPERTIES WITH RECURSIVE DEFINITIONS

Automating inductive reasoning not only requires finding useful induction formulas, but also comes with the task of proving inductive properties. Section IV detailed our approach towards finding useful induction formulas over recursive definitions. As a next step, we now present our solution towards (more) efficient refutation of inductive properties over recursive definitions.

#### *A. Rewriting with Recursive Function Definitions*

We extend superposition reasoning with two inference rules in support of rewriting recursive functions by their definitions.

First, we focus on a *simplification inference* implementing rewriting by unit equalities, called also demodulation [22]. We adjust demodulation to handle unit clauses describing recursive function definitions, as follows:

$$\frac{f(\overline{\pi}) \doteq t \quad \underline{L}[f(\overline{\pi})\theta] \curvearrowright \mathcal{D}}{L[t\theta] \lor D} \text{ ( $\textsf{DemF}$ )}$$

where f(s)θ tθ and L[f(s)θ] ∨ D f(s)θ = tθ.

Second, we introduce a *generating inference* rule as an instance of superposition rules. Namely, we enable rewriting arbitrary recursive functions with their definitions, as follows:

$$\frac{f(\overline{s}) \doteq t \lor C \quad L[f(\overline{s})\theta] \lor D}{L[t\theta] \lor C\theta \lor D} \text{ (ParF)}$$

Note that (ParF) has no side conditions restricting which terms can be rewritten. As such, (ParF) allows to expand function headers, yet at the cost that small terms may be rewritten into bigger terms w.r.t. the underlining term ordering of a superposition prover. As a result, the simplification ordering constraints of are violated by (ParF), yielding an incomplete extension of superposition. On the other hand, soundness of superposition implies soundness of our new inference rules.

Theorem 1 (Soundness of Rewriting). The inference rules (DemF) and (ParF) are sound.

#### *B. Rewriting Induction Hypotheses*

Upon clausifying the induction formula (19) introduced in Section IV, for each step case ∧1≤i≤mL[t<sup>i</sup> ] → L[t] we obtain a set of *induction hypothesis literals* L[t 0 i ] and an *induction conclusion literal* L[t 0 ]. Intuitively, we extend these notions such that any literal resulting from the rewriting or simplification of induction hypothesis or induction conclusion literals is also an induction hypothesis or induction conclusion literal, respectively.

We introduce an *induction hypothesis rewriting rule*, in short (IndHRW), to (i) rewrite one side of an induction conclusion literal with one of its induction hypothesis literals (against ordering constraints) and (ii) apply induction on the rewritten induction conclusion literal without adding it to the search space:

$$\frac{l = r \lor D \quad s[l] \neq t \lor C}{\mathtt{cnf}(F \to \forall \overline{y}. (s[r] = t)[\overline{y}])} \text{ (\emph{Indiff} \text{RW})}$$

where s 6= t is an induction conclusion literal with corresponding induction hypothesis literal l = r, l 6 r, and F → ∀y.(s[r] = t)[y] is a valid induction formula. By soundness of (Ind), we conclude soundness of (IndHRW).

Theorem 2 (Soundness of Induction Hypothesis Rewriting). The inference rule (IndHRW) is sound.

Note that (IndHRW) allows rewriting only with induction hypothesis literals that are positive equalities. Hence, the induction conclusion literal must be a disequality (s 6= t). We further stress that rewriting using the premises of (IndHRW) yields s[r] 6= t ∨ C ∨ D, which is binary resolved against the resulting induction formula clauses of (19) and not added to the search space.

Example 6. Continuing Example 5, rewriting (26) with (ParF) results in a new induction conclusion literal:

$$\begin{aligned} \mathtt{app}(\mathtt{app}(\mathtt{f1at}(\sigma\_2), \mathtt{cons}(\sigma\_3, \mathtt{f1at}(\sigma\_4))), \sigma\_5) &\neq \\ \mathtt{af1at}(\sigma\_2, \mathtt{cons}(\sigma\_3, \mathtt{af1at}(\sigma\_4, \sigma\_5))) \end{aligned} \quad (27)$$

By rewriting the right-hand side of (27) with the corresponding hypotheses literals (24) and (25), we obtain the intermediate induction conclusion literal

$$\begin{aligned} \mathtt{app}(\mathtt{app}(\mathtt{f1at}(\sigma\_2), \mathtt{cons}(\sigma\_3, \mathtt{f1at}(\sigma\_4))), \sigma\_5) &\neq \\ \mathtt{app}(\mathtt{f1at}(\sigma\_2), \mathtt{cons}(\sigma\_3, \mathtt{app}(\mathtt{f1at}(\sigma\_4), \sigma\_5))) \end{aligned} \quad (28)$$

By applying induction with (IndHRW) with case distinction Θapp(flat(σ2),cons(σ3,flat(σ4))) and induction term flat(σ2), we obtain the induction formula:

$$\begin{aligned} & \quad \forall x, y, z. \\ & \quad \Big( \Big( \operatorname{app}(\operatorname{app}(\text{nil}, \operatorname{cons}(\sigma\_3, \texttt{flat}(\sigma\_4))), \sigma\_5) = \\ & \quad \operatorname{app}(\texttt{nil}, \texttt{cons}(\sigma\_3, \texttt{app}(\texttt{flat}(\sigma\_4), \sigma\_5))) \Big) \land \\ & \quad \Big( \operatorname{app}(\operatorname{app}(y, \texttt{cons}(\sigma\_3, \texttt{flat}(\sigma\_4))), \sigma\_5) = \\ & \quad \operatorname{app}(y, \texttt{cons}(\sigma\_3, \texttt{app}(\texttt{flat}(\sigma\_4), \sigma\_5))) \to \\ & \quad \operatorname{app}(\operatorname{app}(\texttt{cons}(x, y), \texttt{cons}(\sigma\_3, \texttt{flat}(\sigma\_4))), \sigma\_5) = \\ & \quad \operatorname{app}(\texttt{cons}(x, y), \texttt{cons}(\sigma\_3, \texttt{app}(\texttt{flat}(\sigma\_4), \sigma\_5)))) \Big) \\ & \quad \Big( \operatorname{app}(\texttt{cons}(x, y), \texttt{cons}(\sigma\_3, \texttt{app}(\texttt{flat}(\sigma\_4), \sigma\_5))) \Big) \Big) \\ & \quad \Big( \operatorname{app}(x, y), \texttt{cons}(x, y) \Big) \Big) \end{aligned} (2)$$

$$\begin{aligned} \rightarrow \mathtt{app}(\mathtt{app}(z, \mathtt{cons}(\sigma\_3, \mathtt{flat}(\sigma\_4))), \sigma\_5) &= \\ \mathtt{app}(z, \mathtt{cons}(\sigma\_3, \mathtt{app}(\mathtt{flat}(\sigma\_4), \sigma\_5))) &\end{aligned}$$

The resulting clauses – after binary resolving with the intermediate unit clause (28) – can be finally refuted using the definitions at lines 11 and 12 of Figure 1. We thus validate correctness of the assertion on line 17 in Figure 1.

#### VI. MULTI-CLAUSE INDUCTION IN SUPERPOSITION

The induction rule (Ind) does not allow inducting on multiple literals, limiting for example the use of (Ind) over (14) in Example 2. Moreover, when (Ind) is used together with the induction formula (19), clausification introduces new Skolem constants, making it impossible to use ground assumptions or previous induction hypotheses containing different ground subterms. To address this issue, in this section we revise the induction inference rule (Ind) with only one premise to an *induction rule with multiple premises*, as follows.

We extend (Ind) for a given literal L (the *main literal*) to also incorporate other literals L<sup>i</sup> (the *side literals*) that are relevant for proving L, as follows:

$$\frac{L\_1[\overline{t}] \lor C\_1 \quad \dots \quad L\_n[\overline{t}] \lor C\_n \quad \overline{L}[\overline{t}] \lor C}{\mathtt{cnf}(F \to \forall \overline{y}.(\bigwedge\_{1 \le i \le n} L\_i[\overline{y}] \to L[\overline{y}]))} \text{ (\mathtt{IndMC})},$$

where L and L<sup>i</sup> are ground literals, C and C<sup>i</sup> are clauses, and F → ∀y.( V <sup>1</sup>≤i≤<sup>n</sup> L<sup>i</sup> [y] → L[y]) is a valid induction formula. Further, y and t are tuples of variables and induction terms, respectively. Soundness of (IndMC) follows then from soundness of (Ind).

Theorem 3 (Soundness of Multi-clause Induction). The rule (IndMC) is sound.

We note that after the application of (IndMC), binary resolution can be applied on each resulting clause with the main and side literals, yielding cnf(¬F) ∨ W <sup>1</sup>≤i≤<sup>n</sup> C<sup>i</sup> ∨ C.

Multi-Clause Induction Formula over Active and Accumulator Terms. For generating valid induction formulas to be used in (IndMC), we proceed as in Section IV. Yet, we adjust the generation of (19), by using Definition 3 over the active and accumulator terms of ∧ n <sup>k</sup>=1Lk[c 0 ] → L[c 0 ] (rather than just L[c]). As a result, for a given case distinction Θf(c) , we generate the following *multi-clause induction formula over active and accumulator terms* in saturation-based proving:

$$\begin{split} (\forall) & \bigwedge\_{(\theta,R)\in\Theta\_{t\left(\overline{c}\right)}} \left( \bigwedge\_{\theta'\in R} (\wedge\_{k=1}^n L\_k[\overline{c'}]\theta' \to L[\overline{c'}]\theta') \to \\ & (\wedge\_{k=1}^n L\_k[\overline{c'}]\theta \to L[\overline{c'}]\theta) \right) \to (\wedge\_{k=1}^n L\_k[\overline{c'}] \to L[\overline{c'}]) \end{split} (30)$$

Since (30) is a valid induction formula, using it in the conclusion of (IndMC) yields a sound (IndMC) inference.

Example 7. Negating and clausifying the assertion on line 18 of Figure 1, we obtain the two unit clauses:

$$\mathtt{even}(\sigma\_1) \tag{31}$$

$$\neg \text{even}(\text{nu1}(\sigma\_0, \sigma\_1))\tag{32}$$

Inducting on (32) using Θmul(σ0,σ1) and induction term σ0, we get the following clauses:

$$\begin{aligned} &\mathsf{\neg{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\langle\}}}}}}}}}}}}}}}}}}}}}}}}} $$
}}}

By function and predicate definitions of mul and even, the base case reduces to false and we are left with the unit clauses

$$\mathtt{even}(\mathtt{null}(\sigma\_2, \sigma\_1))\tag{33}$$

$$\neg \text{even}(\text{add}(\text{mu1}(\sigma\_2, \sigma\_1), \sigma\_1))\tag{34}$$

The hypothesis literal in (33) and the conclusion literal in (34) cannot be binary resolved with each other to solve the step case but they share the term mul(σ2, σ1). We can use (33) and (34) in (IndMC) as side and main literals, respectively, with induction term mul(σ2, σ1) and the case distinction:

$$\Theta\_{\mathsf{even}(\mathsf{null}(\sigma\_2, \sigma\_1))} = \left\{ (\{z \mapsto \mathsf{zero} \}, \emptyset), (\{z \mapsto \mathsf{s}(\mathsf{zero})\}, \emptyset), \right\}$$

We get the following induction formula:

$$\begin{array}{c} \forall x, z. \Big( \Big( \mathsf{even} (\mathsf{zero}) \to \mathsf{even} (\mathsf{add} (\mathsf{zero}, \sigma\_1)) \Big) \Big) \land \\ \qquad \Big( \mathsf{even} (\mathsf{s} (\mathsf{zero})) \to \mathsf{even} (\mathsf{add} (\mathsf{s} (\mathsf{zero}), \sigma\_1)) \Big) \Big) \land \\ \qquad \Big( \Big( \mathsf{even} (x) \to \mathsf{even} (\mathsf{add} (x, \sigma\_1)) \Big) \to \\ \qquad \Big( \mathsf{even} (\mathsf{s} (\mathsf{s} (x))) \to \mathsf{even} (\mathsf{add} (\mathsf{s} (\mathsf{s} (x)), \sigma\_1)) \Big) \Big) \\ \to \Big( \mathsf{even} (z) \to \mathsf{even} (\mathsf{add} (z, \sigma\_1)) \Big) \Big) \end{array} \begin{array}{c} \begin{array}{c} \mathsf{add} (\mathsf{s} (\mathsf{s} (\mathsf{red} (z)))) \Big) \to \\ \qquad \qquad \qquad \Big( \mathsf{add} (z, \sigma\_1) \Big) \Big) \Big) \end{array} \Big) \end{array} \tag{35}$$

After clausifying (35), and binary resolving the resulting clauses against (33) and (34), using function and predicate definitions and the unit clause (31), we arrive at the empty clause, thus validating the assertion at line 18 in Figure 1.

We conclude this section by noting that the (IndMC) inference rule might use an arbitrary number of side literals, slowing down the practical efficiency of saturation-based proving with multi-clause induction. As a remedy, the following two heuristics could be used to choose the literal L from clause L ∨ C as a side literal of (IndMC): (i) if L is p(s) for some predicate p, and L is an induction hypotheses to the main literal p(t), and s and t share some non-Skolem (complex) term with an active occurrence, or (ii) if neither L nor the main literal are derived from a clausified induction formula and they share some common term with an active occurrence.

#### VII. EXPERIMENTS

Implementation. We implemented our approach to automating induction with recursive definitions in superpositionbased theorem prover VAMPIRE. We extended VAMPIRE's induction framework [18] with recursive definitions and hypothesis strengthening, as described in Section IV. This can be enabled with --structural\_induction\_kind rec\_def. Rewriting with induction hypotheses and function definitions, as presented in Section V, can be switched on using --induction\_hypothesis\_rewriting on and --function\_definition\_rewriting on, respectively. The multi-clause induction rule from Section VI is enabled by --induction\_multiclause on. All together, our implementation consists of around 5,000 lines of C++ code and is available at https://github.com/vprover/ vampire/tree/induction-recursive-functions.

Experimental setup. To experimentally evaluate our approach, we used the benchmarking tool BENCHEXEC [27], [28] and two benchmark sets<sup>2</sup> : (i) the UFDTLIA examples from SMT-LIB [24], consisting of 327 problems over algebraic data types; and (ii) our new set dty RD of 3,397 inductive examples with recursive definitions, as described in [30]. We used the keyword define-fun-rec for defining recursive functions in the examples from our dty RD dataset. Moreover,


Fig. 2. Numbers of problems solved by respective solvers in our experiments. The number in parentheses is the number of problems solved uniquely compared to the other solvers.

we also converted examples from the UFDTLIA set to explictly use define-fun-rec, detecting this way recursive definitions in UFDTLIA.

We also combined our inductive approach in VAMPIRE with recent developments in first-order reasoning [18], [31], [32], creating this way various VAMPIRE configurations for automating induction with recursive definitions. The *default options* we used for these configurations are: --induction\_gen on --induction\_on\_complex\_terms on

enabling inductive generalizations and induction on complex terms [18]; --newcnf on to select the cnf method in [31]; and --theory\_split\_queue on --theory\_split\_queue\_cutoffs 0,8 and --theory\_split\_queue\_ratios 20,10,1 to control theory reasoning with split queues [32]. As a result, we designed a new VAMPIRE portfolio mode for inductive reasoning, which can be switched on by --mode portfolio --schedule struct\_induction.

Experimental comparison. In what follows, VAMPIRE refers to the (default) version of VAMPIRE, as in [18]. By VAMPIRE<sup>∗</sup> we denote our new version of VAMPIRE, using induction with recursive definitions and the aforementioned options. We compared our work in VAMPIRE<sup>∗</sup> against VAMPIRE, as well as against the superposition prover ZIPPERPOSITION<sup>3</sup> [16] and the SMT solver CVC4 [33].

Since the default mode of VAMPIRE and VAMPIRE<sup>∗</sup> only occasionally solves unique problems with respect to their portfolio mode counterpart, we omitted the former results. Note that we used the same portfolio schedule struct\_induction for VAMPIRE as well. Since in portfolio mode VAMPIRE ignores the new options and most of the schedule is not specific to VAMPIRE<sup>∗</sup> , the results obtained for VAMPIRE give a meaningful baseline. We used ZIPPERPOSITION in the default mode, while for CVC4 we used the parameters --conjecture-gen --quant-ind. Each prover was given 300 seconds of time and 16 GB of memory per problem. The experiments were ran on computers with 32 cores (AMD Epyc 7502, 2.5 GHz) and 1 TB RAM. Experimental results. We summarize our experimental results in Figure 2. For each solver, listed in the first column of

<sup>2</sup>While some examples from the TIP library [29] are included in SMT-LIB, most of the TIP examples are parametric and not yet supported by VAMPIRE.

<sup>3</sup>ZIPPERPOSITION has a non-official option --input tip to parse benchmarks in a variant of SMT-LIB. In order to parse UFDTLIA benchmarks, we converted them to this variant.


Fig. 3. Numbers of problems solved by VAMPIRE<sup>∗</sup> with different new features disabled. The number in parentheses is the number of problems solved uniquely compared to the other configurations.

Figure 2, we indicate the total number of examples the solver proved from the respective benchmark category; the values in parentheses show the number of uniquely solved problems compared to the other solvers. Figure 2 shows that while VAMPIRE performs reasonably well on both benchmark sets, it cannot solve more problems than CVC4 in the UFDTLIA set and than ZIPPERPOSITION in the dty RD set, where the latter two perform the best. VAMPIRE<sup>∗</sup> , on the other hand, is able to solve many more problems than the other solvers in both sets, suggesting that combining the stateof-the-art techniques of superposition with induction over recursive definition can perform much better than SMT solvers and superposition provers with only structural induction. All together, VAMPIRE<sup>∗</sup> solved 527 new problems that the other automated solvers could not prove. It is also worth noting that while VAMPIRE<sup>∗</sup> dominates the uniquely solved problems w.r.t. the dty RD set, its dominance is only marginal compared to the uniquely solved problems of CVC4 in the UFDTLIA set. Looking at the problems uniquely solved by CVC4, we found that these problems mostly contain either some nested structure that current techniques in VAMPIRE<sup>∗</sup> cannot handle and require non-trivial lemma generation or recursive definitions that cannot be used with our induction formula generation as their well-foundedness is not based on the subterm relation.

In addition to comparing to other solvers, we compared VAMPIRE<sup>∗</sup> to itself with different techniques from the paper disabled, overriding the portfolio options during these runs. Our results are shown in Figure 3.

For UFDTLIA, the default run still performs best but we can see different deviations from this value with each disabled technique. We argue that the relatively small differences obtained by turning off induction hypothesis rewriting (-indhrw off) and function definition rewriting (-fnrw off) can be attributed to combinations of options that together may simulate these techniques. In comparison, multiclause induction cannot be simulated with other techniques in VAMPIRE, so the relatively small difference obtained by turning off this technique (-indmc off) for UFDTLIA is probably due to the lack of non-unit induction needed in most of this set. For dty RD, the decrease in solved problems when this feature is turned on needs further investigation. The greatest difference to the default is obtained by using structural induction (-sik one, see [17]) instead of inferring induction formulas from recursive function definitions. We can conclude with the observation that each configuration solved problems uniquely which suggests the portfolio schedule can be improved.

#### VIII. RELATED WORK

Generation of induction formulas, as presented in Section IV, although similar to *recursion analysis* of [7] and *recursion induction* of [10], utilizes unification and generates non-trivial induction hypotheses. Our work complements these techniques by integrating induction in saturation: rather than replacing inductive goals by sub-goals/other formulas, we generate induction formulas over recursive definitions and add these induction formulas as additional properties to the search space.

When compared to superposition approaches treating certain E-theories [19] or function definitions as rewrite rules [16], we note that our method designs new induction inference rules as simplification rules in superposition and strengthens induction hypotheses during saturation-based inductive reasoning. Our approach extends [17] by handling recursive definitions as rewrite rules and multiple clauses in a single induction step; the latter is often required when assumptions are supported in universally quantified conjectures. Unlike [16], our technique generalizes to scenarios where multiple induction steps are needed to refute non-equality literals. Contrarily to [12], we are not limited to induction over term algebras as most of these techniques work for e.g. mathematical induction as well.

While our approach often does not need auxiliary lemmas due to generalizations over (complex) term occurrences and strengthened induction hypotheses, extending our work towards lemma generation would be beneficial. In particular, theory exploration and lemma generation approaches from [8], [15], [10], [34], [35], [13] could complement our method, ranging from randomly generating terms by iterative deepening to analysing failed induction steps and even circumventing the need for auxiliary lemmas by using predicates.

#### IX. CONCLUSION

We introduce a new approach for automating induction with recursive definition in first-order theorem proving. We design new inference rules for rewriting with function definitions as well as induction hypotheses in superposition-based proving. We generate induction formulas based on recursive function definitions and extend our work to support multiclause induction. Our experiments show that induction with recursive definitions in superposition allows us to solve many new problems that other automated reasoners failed to prove.

#### ACKNOWLEDGMENTS

This work was partially funded by the ERC CoG ARTIST 101002685, the ERC StG SYMCAR 639270, the EPSRC grant EP/P03408X/1, the FWF grant LogiCS W1255-N23, the Amazon ARA 2020 award FOREST and the TU Wien SecInt DK.

#### REFERENCES


# Fair and Adventurous Enumeration of Quantifer Instantiations

Mikola´s Janota ˇ *Czech Technical University in Prague* Prague, Czech Republic Haniel Barbosa *Universidade Federal de Minas Gerais* Belo Horizonte, Brazil Pascal Fontaine *University of Liege `* Liege, Belgium ` Andrew Reynolds *University of Iowa* USA

*Abstract*—SMT solvers generally tackle quantifers by instantiating their variables with tuples of terms from the ground part of the formula. Recent enumerative approaches for quantifer instantiation consider tuples of terms in some heuristic order. This paper studies different strategies to order such tuples and their impact on performance. We decouple the ordering problem into two parts. First is the order of the sequence of terms to consider for each quantifed variable, and second is the order of the instantiation tuples themselves. While the most and least preferred tuples, i.e. those with all variables assigned to the most or least preferred terms, are clear, the combinations in between allow fexibility in an implementation. We look at principled strategies of complete enumeration, where some strategies are more fair, meaning they treat all the variables the same but some strategies may be more adventurous, meaning that they may venture further down the preference list. We further describe new techniques for discarding irrelevant instantiations which are crucial for the performance of these strategies in practice. These strategies are implemented in the SMT solver cvc5, where they contribute to the diversifcation of the solver's confguration space, as shown by our experimental results.

*Index Terms*—SMT, quantifer instantiation, enumeration

#### I. INTRODUCTION

While SMT (satisfability modulo theory) solvers [5] are used successfully as decision procedures to automatically discharge quantifer-free proof obligations for many applications, there is an increasing need for tools that can furthermore handle quantifers. Quantifed languages however are most often undecidable, or have prohibiting complexity. Quantifer handling within SMT solving is thus a challenge and requires good heuristics.

Quantifer reasoning in SMT builds on the strength of SMT solvers, that is, their ability to effciently reason on ground formulas, and relies on instantiation: ground consequences of quantifed formulas are generated, and the ground reasoner's view of the problem is gradually refned with these instances, to embed knowledge from the quantifed formula into ground reasoning. The terms to generate instances may be generated using mostly syntactic methods, e.g., E-matching [6], or semantic techniques like model-based quantifer instantiation [7]. But plain enumeration, done in a principled manner, can give surprisingly good results, particularly in combination with other instantiation techniques [8].

A crucial aspect, when using enumeration-based instantiation, is to prioritize the numerous, often infnite, potential instantiations. When instantiating just one variable, this is essentially a matter of prioritizing smaller terms that are already present in the original formula, according to some order. Quantifed assertions however most often have many quantifed variables, and there is a lot of freedom on the order on tuples of terms to instantiate those. We here investigate a few strategies based on different tuple orders, some favoring fairness, some being more adventurous, and show that they are valuable in a portfolio of enumerative instantiation strategies. In Section IV, we also present an elimination technique for redundant instantiations that signifcantly contributes to the improvement of enumeration-based instantiation.

#### II. BACKGROUND

Originally, SMT solvers were essentially decision procedures for ground (i.e., quantifer-free) problems in a combination of decidable languages, containing e.g., operators to handle arrays, linear arithmetic expressions, bitvectors, and uninterpreted predicates and functions. They excel at deciding the satisfability of large formulas in these languages. As a toy example, consider the (satisfable) conjunctive set of formulas

$$\{R(a), \neg S(b), a = b\}.$$

It belongs to the quantifer-free fragment of frst-order logic, and as such, is decided by many SMT solvers. Quantifer reasoning in modern SMT solvers builds on this. The input formula, possibly after a pre-processing phase, is frst given to the ground solver. From the point of view of this ground solver, each quantifed formula is abstracted into a distinct propositional variable. As an example, the conjunctive set

$$\{R(a), \neg S(b), a = b, \forall x \,. R(x) \Rightarrow S(x)\}$$

is understood by the ground solver as the previous ground set, augmented with an abstract proposition Q corresponding to ∀x . R(x) ⇒ S(x). Then the ground solver provides a satisfying assignment for the ground part of the formula, including a valuation of the propositional variables abstracting the quantifed formulas (in our case Q must be true). The instantiation module recovers the quantifed formulas associated to these variables, and generates new instances of the quantifed formulas to the ground reasoner (Figure 1). In our toy example such an instance could be

$$Q \Rightarrow \left(R(a) \Rightarrow S(a)\right),$$

https://doi.org/10.34727/2021/isbn.978-3-85448-046-4 <sup>35</sup> This article is licensed under a Creative Commons Attribution 4.0 International License

Fig. 1. The SMT instantiation loop.

which would render the problem unsatisfable at the ground level. In general, the instantiation loop is iterated until the ground reasoner is able to conclude that the formula is unsatisfable, a time out is reached, or no instance can be deduced anymore. In this paper, we focus on refutations only and will not consider the last case.

Thanks to the Herbrand Theorem (see e.g., [8]), with fair enumeration of instances using all possible terms built on the appropriate set of symbols, SMT solving is refutationally complete for satisfability modulo well-behaved frst-order theories. Since typical SMT inputs contain hundreds of quantifed formulas with many nested quantifers, on a language with often infnitely many terms, the number of possible instances is very large, and most often infnite. It is crucial to quickly fnd out the right instances, otherwise the ground solver will be overwhelmed by the amount of instances. For a quantifed formula ∀x<sup>1</sup> . . . x<sup>n</sup> . φ with n variables, this boils down to ordering n-tuples of ground terms to prioritize instantiation.

#### III. ENUMERATION STRATEGIES

We start by the assumption that for each variable x<sup>i</sup> there is a sequence of terms T<sup>i</sup> = t 1 i , t<sup>2</sup> i , . . . , which are the possible candidates for instantiation into the variable x<sup>i</sup> . We further assume that this sequence of terms is sorted by some given preference, i.e., that t j i is more likely to yield a useful instantiation than the candidate t j ′ <sup>i</sup> with j < j′ . This lets us focus on the indices into the sequences of terms, rather than on the terms themselves. An instantiation, i.e., a tuple of terms, is uniquely represented as an n-tuple of indices.

While this setup already assumes a given order on the terms for the individual variables, it does not tell us how to order the actual tuples. Clearly, the tuple of indices (0, . . . , 0) is the most advantageous and (|T1| − 1, . . . , |Tn| − 1) is the least advantageous one. However, it is unclear whether (0, 1, 1) is more advantageous than (0, 0, 2), or the other way around. This motivates our quest for different enumeration strategies. A general notion from multi-objective optimization is useful: *Pareto-optimal* solutions are such that improving any criterion worsens some other.

Fig. 2. Pareto graph for 3 variables with 4 candidate terms for each.

Defnition 1 (Pareto dominates). *Let* t<sup>1</sup> = (a1, . . . , an) *and* t<sup>2</sup> = (b1, . . . , bn) *be* n*-tuples of integers. We say that* t<sup>1</sup> Pareto dominates t2*, if and only if* t<sup>1</sup> ̸= t<sup>2</sup> *and* a<sup>i</sup> ≤ b<sup>i</sup> *for all* i ∈ 1..n*.*

We focus on traversals of the graph of tuples where traversing an edge increases one of the indices. Hence, there is an edge from tuple t<sup>1</sup> to tuple t<sup>2</sup> iff t<sup>2</sup> is obtained by increasing either of the digits of t<sup>1</sup> by 1; see Figure 2. This graph anchors our initial motivation that the order on the terms pertaining to a single variable represents preference. Indeed, following down any edge in this graph means going to a less preferred tuple. We call this graph the *Pareto graph*.

So what does differentiate one traversal from another? In graph theory vernacular, a traversal is broad or deep. In our context, a broad traversal is more *fair* since it alters terms of different variables evenly. A deep traversal is more *adventurous* since it opts for less preferred, i.e., riskier, instantiations.

Fair strategies observe the Pareto ordering, meaning that no tuple dominates any of the previous tuples. For instance, the sequence (0, 0),(0, 1),(1, 0),(1, 1) respects Pareto ordering but (0, 0),(0, 1),(1, 1),(1, 0) does not because (1, 0) Paretodominates (1, 1). Note that both of these examples respect the Pareto graph in the sense that a node is visited only if at least one of its predecessors has been visited.

In the remainder of the section we introduce techniques considered in the experimental evaluation in Section V. On a technical note, in practice the number of possible candidates per variable may vary, but for the sake of clarity, we assume that each variable has the same number of possible candidate terms. This means that every element of the tuple (digit) is in the range 0..M for some fxed M ∈ N. Effectively, this means that we are looking for systematic enumerations of tuples from the space [0..M] <sup>n</sup>, with a fxed set of n variables.

#### *A. Stages by maximal digit [8]*

This ordering interprets tuples as numbers in increasing base b ∈ 2..(M + 1). As an example, consider two variables and M = 2. The enumeration starts with base 2, yielding: (0, 0),(1, 0),(0, 1),(1, 1). Subsequently, it switches to base 3, while skipping already enumerated tuples, giving the rest of the tuples: (2, 0),(2, 1),(0, 2),(1, 2),(2, 2).

This is a natural alternative to interpreting the tuples as numbers in base M + 1, which would lead to a highly unfair strategy because large values of M would lead to changing signifcant digits very late.

This ordering observes Pareto domination and the enumeration algorithm runs in constant space.

#### *B. Stages by sum of digits*

The maximum digit approach mitigates unfairness in large value of M (large number of candidate terms). However, it still leads to an imbalance with a large number of quantifed variables, i.e., with large tuples. Indeed, even with M = 1 already 10 variables require 2 <sup>10</sup> iterations before the most signifcant digit is changed. The alternative is to iterate over combinations stratifed by the *sum of all the digits*. Tuples with the same sum of digits are ordered lexicographically.

This leads to a breadth frst traversal of the Pareto graph and its effect is more pronounced with large number of variables. The initial sequence has the following form:

$$(0,0,\ldots,0), (1,0,\ldots,0), (0,1,\ldots,0), \ldots, (0,0,\ldots,1), \\ (2,0,\ldots,0), (1,1,\ldots,0), (0,2,\ldots,0), \ldots$$

This ordering also observes the Pareto domination and can be calculated in constant space.

#### *C. Leximax*

Arguably the most fair strategy is enumeration according to the *leximax order* [1] since all the variables are in equivalent roles: let t1, t<sup>2</sup> be n-tuples of integers. We say that t<sup>1</sup> is *leximax preferred* to t<sup>2</sup> if t ↓ 1 is lexicographically smaller than t ↓ 2 , where t <sup>↓</sup> denotes t sorted in descending order. Enumeration can be done in constant space. We observe that all permutations of a tuple are incomparable. This enables us to stage the enumeration by gradually worsening a sorted tuple and enumerate all its permutations through standard means. The incomparable permutations are enumerated lexicographically. For two variables the sequence starts as follows, (0, 0),(0, 1),(1, 0),(1, 1),(0, 2),(2, 0). Contrast that with the sum of digits (0, 0),(0, 1),(1, 0),(0, 2),(1, 1),(2, 0).

#### *D. Iterative Deepening and Random-walk Search*

Strategies discussed so far never violate Pareto domination, which would be violated by depth-frst but that would have a large degree of unfairness. Instead, we propose to use *iterative deepening* where the maximum depth is incremented by some fxed parameter k ∈ N <sup>+</sup>. Maximum depth 2 yields (0, 0),(0, 1),(0, 2),(1, 1),(1, 0),(2, 0), where (1, 0) Paretodominates (1, 1), even though it comes later in the sequence.

As another very adventurous strategy, we propose *randomwalk traversal*, which is similar to DFS but instead of a stack we use a set where the next element is chosen randomly.

#### IV. DISCARDING REDUNDANT INSTANTIATIONS

When solving quantifed formulas, SMT solvers are often hindered by an overabundance of generated instantiations. Thus, it is paramount to avoid instantiations that are *redundant*. At a high level, an instantiation is considered redundant if it does not help rule out models in the current context. Methods for discovering redundant instantiations are particularly important in the context of enumerative instantiation, where typically we are iterating over similar domains of terms on multiple instantiation rounds, and are looking for the frst instantiation that is not redundant.

In our implementation, we consider three criteria for determining that an instantiation φ · {x<sup>1</sup> ↦→ t1, . . . , x<sup>n</sup> ↦→ tn} is redundant, in increasing order of cost:


when an instantiation lemma is already implied by the current set of constraints known by the SMT solver. All instantiations that are entailed are considered redundant.

3) (Duplicate Formula Modulo Rewriting) Maintain a set of previous formulas returned by quantifer instantiation. Construct the formula φ · {x<sup>1</sup> ↦→ t1, . . . , x<sup>n</sup> ↦→ tn} and normalize it using rewriting techniques. If the resulting formula is already in our set, it is redundant.

If none of these criteria hold, the instantiation is not considered redundant.

It is important to note that the latter two methods allow one to learn that a *class* of instantiations is redundant. For this purpose, we introduce the concept of a *fail mask* for an instantiation. A fail mask M for a substitution {x<sup>1</sup> ↦→ t1, . . . , x<sup>n</sup> ↦→ tn} is a sequence of n bits such that all substitutions that extend {x<sup>i</sup> ↦→ t<sup>i</sup> | the i th bit of M is set } when applied to φ result in a redundant instantiation.

For example, let φ be the formula P(x1, x2) ∨ Q(x2, x3), and consider the substitution σ = {x<sup>1</sup> ↦→ a, x<sup>2</sup> ↦→ b, x<sup>3</sup> ↦→ c}. Let E = {P(a, b), ¬Q(b, c)} be the current set of assertions from the ground solver. The instantiation φ · σ is redundant; a fail mask for σ is 110, since P(a, b) ∨ Q(b, x3) is entailed by E for any value of x3.

We incorporate fail masks into our implementation in the following way. When an instantiation φ · σ is discovered to be redundant, we construct the fail mask M containing all 1s. Starting with i = 1, we drop the entry {x<sup>i</sup> ↦→ ti} from σ. If the instantiation is still redundant based on the latter two criteria above, then we set the i th bit to 0. If not, then we readd the entry {x<sup>i</sup> ↦→ ti} to σ, and proceed with i + 1. Notice this means that our computation of the fail mask is greedy.

The fail mask is incorporated into the enumerative strategies as follows. After each failed instantiation, combine the tuple of term indices and the fail mask into a tuple with wildcards, denoted "?". So for instance, if the tuple (5, 4, 3) fails with the mask 101, construct the tuple (5, ?, 3) meaning that if the frst variable is instantiated with the 5 th term and the third variable with the 3 rd term, the instantiation is bound to be redundant. Such combinations we wish to avoid. This is checked independently of the enumeration algorithm by storing the disabled patterns into a trie and discarding any combinations matching one of the previously disabled patterns. The trie handles the wildcard character ? specially by always matching on it.

#### V. EXPERIMENTS

This section reports on our experimental evaluation of different tuple enumeration strategies implemented in the cvc5 SMT solver (the successor of CVC4 [3]). We performed all experiments on a cluster with Intel Xeon CPU E5-2620 CPUs with 2.1GHz and 128GB memory, providing one core, 300 seconds, and 8GB RAM for each job.

Enumerative instantiation is extensively compared with other techniques in [8], where it was concluded that interleaving E-matching with enumeration gives the best results. However, as the focus of the paper is the different enumeration

TABLE I SUMMARY OF PROBLEMS SOLVED. BEST NON-PORTFOLIO RESULTS ARE IN BOLD.


strategies, we run enumeration on its own. For succinctness, we omit certain details, such as relevant domain heuristic, run as proposed in [8].

Benchmarks are selected from frst-order benchmarks from the TPTP library [10], version 7.4.0, and from SMT-LIB [4], 2020 release. Of 19287 frst-order TPTP problems, we excluded 660 which contained polymorphic types, leaving 18627 for consideration. For SMT-LIB, we considered all problems from logics containing quantifers and integer arithmetic, i.e., UF, UFLIA, and UFNIA, totaling 31314 problems. This selection of benchmarks was inspired by the evaluation from [8], where enumerative instantiation was shown more effective in the above sets.

Fig. 3. Impact of elimination of redundant instantiation via fail masks.

The evaluation covers a number of cvc5 confgurations. The default enumeration, maximal digit, is denoted as u. Its variations according to different enumeration strategies described above are id-n for iterative deepening with increment n; lmax for leximax; sum of digits; and rwlk for random walk. We also run, for control, cvc5's E-matching (denoted e) and z3 4.8.10 (denoted z3). By default z3 uses a combination of E-matching and model-based quantifer instantiation. All the cvc5 confgurations run with the fail-masks technique enabled; further, they use confict-based instantiation [2], [9] as a "fail-fast" technique, given its strong focusing effect. The implementation of E-matching in cvc5 already uses a redundancy checking mechanism [2], which is always enabled in our experiments. The z3 evaluation is restricted to SMT-LIB, given its limited support for TPTP.

The results are summarized in Table I. The column alluport is a virtual best solver (vbs) of all the enumerative confguration, eu-port of a vbs of only e and u, and eallu-

TABLE II SUMMARY PROBLEMS SOLVED UNIQUELY PER STRATEGY.


port a vbs of all cvc5 confgurations. We frst emphasize the tremendous advantage in UFNIA of u over e, which can be explained by many benchmarks needing instantiations with key arithmetic constants, such as 0, to enable the necessary ground reasoning to solve the problem. However, a large number of these benchmarks may be impossible to solve via E-matching alone: if matching needs to be done on terms containing arithmetic operators, e.g. to match x+ 1 with 1, Ematching will fail, whereas enumerative instantiation would instantiate the formula regardless. Moreover, the different enumeration strategies do lead to signifcant orthogonality among the different confgurations. The number of uniquely solved problems per strategy is shown in Figure II. Note also that the vbs of the enumerative confgurations versus u reduces the number of *unsolved* problems in UFNIA in almost 3%, while eallu-port vs eu-port reduces the number of unsolved in almost 2%. These improvements are also present in TPTP, with similar reductions in the number of unsolved problems when considering all the enumeration strategies in a virtual best solver. This clearly shows the beneft of integrating into actual portfolios different enumeration strategies rather than having just the default one.

We also evaluated an even more adventurous enumeration strategy than those in Table I, which randomly changes the strategy at each instantiation round, thus effectively simultaneously trying all the strategies. This random strategy performs similarly to the others but can be deeply infuenced by the random seed chosen for selecting a strategy each round, to the extent that changing the seed from 0 to 7 makes it go, in UFLIA, from 6007 successes to 6047. This further reinforces the usefulness of diversifying the set of strategies used for quantifer instantiation in practice.

Discarding classes of redundant instantiations using fail masks gives a clear advantage as illustrated in Figure 3 (default enumerative instantiation strategy, on all benchmarks). Using the fail masks leads to 217 uniquely solved problems, whereas without it only 31 problems are solved uniquely. Moreover, a large number of commonly solved problems have very signifcant speed-ups, as the plot makes clear. These improvements can be explained by the technique being the most effective in problems containing quantifers with many variables, which are common occurrences among the benchmark sets we considered. On problems where the fail masks do not help, the overhead of computing and checking them is noticeable (see the often prevalent crosses just below the diagonal line). However, it is far from a deterrent, given the signifcant gains.

#### VI. CONCLUSIONS

Enumerative instantiation is powerful, versatile, and offers a lot of freedom for strategies. We presented several ordering heuristics for instantiation that contribute to the orthogonality of the strategies, and ultimately improve the SMT solver's performance and robustness. This is especially useful when a user is willing to employ a barrage of solver confgurations to tackle a high-priority problem instance.

In future work, we plan to investigate the applications of enumerative instantiation strategies for portfolio approaches to SMT solving. We also would like to pursue more advanced techniques where tuple and term orderings are not fxed and may be infuenced by previous successes or failures.

#### ACKNOWLEDGMENTS

We thank Mathias Preiner for helping with scripts for computing the experimental results. The results were supported by the Ministry of Education, Youth and Sports within the dedicated program ERC CZ under the project POSTMAN no. LL1902. This scientifc article is part of the RICAIP project that has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 857306.

#### REFERENCES


# Mathematical Programming Modulo Strings

Ankit Kumar and Panagiotis Manolios Northeastern University

Email: {ankitk, pete}@ccs.neu.edu

*Abstract*—We introduce **TranSeq**, a non-deterministic, branching transition system for deciding the satisfiability of conjunctions of string equations. **TranSeq** is an extension of the Mathematical Programming Modulo Theories (MPMT) constraint solving framework and is designed to enable useful and computationally efficient inferences that reduce the search space, that encode certain string constraints and theory lemmas as integer linear constraints and that otherwise split problems into simpler cases, via branching. We have implemented a prototype, **SeqSolve**, in ACL2s, which uses Z3 as a back-end solver. String solvers have numerous applications, including in security, software engineering, programming languages and verification. We evaluated **SeqSolve** by comparing it with existing tools on a set of benchmark problems and our experimental results show that **SeqSolve** is both practical and efficient.

### I. INTRODUCTION

The problem of solving string equations has interested mathematicians and computer scientists for decades. Security, software engineering and verification applications, in particular, have generated a renewed interest in string solvers. Security applications include finding cross-site scripting vulnerabilities in Web applications, SQL injection attacks and fuzzing [1], [2], [3], [4], [5]. Software engineering applications include testcase generation, symbolic evaluation and flow analysis [6], [7], [8]. Programming language applications include type inference for array processing languages [9][10].

The basic problem is easy to define. Let Γ be a non-empty set of constants. The elements of Γ ∗ form a free monoid, *i.e.*, a structure with a single associative operation, corresponding to concatenation, and an identity element . Elements of Γ ∗ are called strings or words. Let X be a set of variables over Γ ∗ and let Y be a set of variables over Γ such that Γ, X and Y are disjoint. Elements in Y are also called *unit variables*. Let Z = X ∪ Y. Elements of the free monoid (Γ∪ Z) ∗ are called sequences, again with as the identity. A *normal substitution* is a partial function ρ : Z \* (Γ∪ Z) ∗ . Every substitution can be extended to the domain (Γ ∪ Z), by defining ρ(a) = a for all a not in the domain of ρ. We can also extend the domain to (Γ∪ Z) ∗ in the standard way. wρ stands for the application of substitution ρ to the sequence w and it extends naturally to sequence equations. A solution of a set of equations {u<sup>1</sup> = v1, u<sup>2</sup> = v2, . . . , u<sup>n</sup> = vn} is a substitution ρ that when applied to each equation yields identical sequences, *i.e.*, {u1ρ = v1ρ, u2ρ = v2ρ, . . . , unρ = vnρ} is a set of syntactic equivalences over (Γ ∪ Z) ∗ . The problem statement is: given a set of sequence equations {u<sup>1</sup> = v1, u<sup>2</sup> = v2, . . . , u<sup>n</sup> = vn} find a solution if there exists one, otherwise return unsat.

Related Work. Makanin, in 1977, proved that the satisfiability of string equations is decidable [11]. A series of results on complexity followed, after which Plandowski showed that the problem is in polynomial space [12]. String solvers supporting a variety of theories are available, *e.g.*, Z3Str3 [13], CVC4 [14], [15], S3P [16], Norn [17], TRAU [18], Str-Solve [19], Sloth [2], Kepler<sup>22</sup> [20] and HAMPI [1]. Z3Str3 and CVC4 are multi-theory SMT solvers which consider unbounded string equations with concatenation, substring, replace and length functionality. Together with S3P and Norn, these tools handle a variety of string constraints including string equations, length constraints and regular language membership. However, these tools are incomplete. HAMPI works only for problems with one string variable of fixed size. Kepler<sup>22</sup> is a decision procedure for the straight line and quadratic fragments of string equations. Norn and TRAU can decide only the acyclic fragment whereas Sloth decides straight line and acyclic fragments. To the best of our knowledge, there is no solver that for decidable fragments is both theoretically and practically complete, *e.g.*, none of the above solvers are able to solve the string equation xcyczvycya = yacwazvbux. Therefore it is important to explore new techniques for solving string equations. One of the most promising existing techniques uses context-dependent techniques to improve the reasoning of string constraints in the context of DPLL(T)-based SMT solvers [15]. Similarly, our work introduces new techniques for reasoning in the context of BC(T)-based (Branch and Cut Modulo T) MPMT solvers [21], [22].

Contributions. Our contributions include (1) TranSeq, a new non-deterministic, branching transition system that can be used as part of the MPMT framework for combining decision procedures, (2) the SeqSolve solver, an implementation of TranSeq which resolves non-deterministic choices in a way designed to infer as much as possible with as few computational resources as possible, (3) proof sketches of soundness, completeness and termination for TranSeq and (4) an evaluation of SeqSolve using a set of benchmarks from related work, as well as Remora examples [9], [10]. We use publicly available benchmarks, being careful to evaluate only the string solving capabilities of our tool, not irrelevant aspects of the underlying SMT/MPMT tools. The integration of our solver into SMT/MPMT tools is briefly discussed. There are over 1,100 problems in our benchmark and no existing string solver can solve all of them. Experimental results show that

SeqSolve is more efficient and complete than existing solvers.

Paper Outline. Section II illustrates some techniques we use to reason about string equations through motivating examples. Section III defines basic terms used to define our transition system and algorithm. Section IV describes TranSeq and SeqSolve. Section V gives proofs sketches of correctness and termination; due to space limitations full definitions and proofs will appear in a full version of the paper. Section VI describes implementation considerations of our prototype and Section VII contains our evaluation. Conclusions and future work appear in Section VIII.

#### II. ILLUSTRATIVE EXAMPLES

In this section, we highlight some of the techniques used in our string equation solver, via a collection of examples, where a, b, c . . . are constants (elements of Γ) and u, v, w, x, y and z are string variables (elements of X ).

Example 1 [ConstUnsat] Consider the string equation b = a. The constant b differs from the constant a so this equation is unsatisfiable. Our algorithm determines by performing partial evaluation that includes evaluating constant prefixes and suffixes of equations.

Example 2 [Trim] Consider xab = xbb. Our algorithm trims common prefixes and suffixes from both sides of the input equation to get a = b which is unsatisfiable by ConstUnsat.

Example 3 [Decompose] Consider xyazy = yxubyz. Prefixes xy and yx have provably equal lengths. So do the suffixes zy and yz. Therefore our algorithm decomposes the input equation into three equations: xy = yx, a = ub and zy = yz. Equation a = ub can be further decomposed into a = b and u = , which is unsatisfiable by ConstUnsat.

Example 4 [EqLength] Consider uvxayvu = vuyxuv. Decomposition generates the two *distinct* equations uv = vu and xay = yx. Notice that if an equation is satisfiable, then both sides have to have the same length and our algorithm generates the constraint l<sup>x</sup> + 1 + l<sup>y</sup> = l<sup>y</sup> + l<sup>x</sup> where l<sup>x</sup> and l<sup>y</sup> denote the lengths of x and y, respectively, which is unsatisfiable.

Example 5 [EqConsts] Consider ax = xb. If the equation is satisfiable, then both sides of the equation must have the same number of occurrences of each constant. To enforce this, our algorithm generates the constraint 1 + c x <sup>a</sup> = c x a , where c x a is the number of a's in x, which is unsatisfiable.

Example 6 [VarElim] Consider the set of (implicitly conjoined) string equations {uv = vu, xa = ax, cy = x}. The last equation has the form of a definition and this allows our algorithm to eliminate x by applying the appropriate substitution to the set of equations, giving us {uv = vu, cya = acy}. Since cya = acy is unsatisfiable, so is the set.

Example 7 [VarSplit] Consider xxa = cyx. One side starts with the constant c so the other side must also start with c, which means x cannot be empty and must start with a c. Our algorithm detects this and adds the equation x = cxˆ, where xˆ is a new string variable. After eliminating x and trimming, we wind up with the equation xcˆ xaˆ = ycxˆ, which decomposes into xcˆ = y and xaˆ = cxˆ. The EqConsts analysis (Example 5) infers that the second equation is unsatisfiable. Our algorithm also does this for suffixes.

Example 8 [VarSubst] Consider wuzwuza = cywuz. The equation is equi-satisfiable with xxa = cyx: we substitute a new string variable, x, for the sequence of string variables, wuz, thereby eliminating all occurrences of w, u and z from all string equations. The resulting equation is unsatisfiable by VarSplit (see Example 7).

Example 9 [Rewrite] Consider the set of (implicitly conjoined) string equations {zv = ba, xxazv = cyxba}. The first equality can be used to rewrite the second equality to xxazv = cyxzv which can be trimmed to xxa = cyx, which is unsatisfiable, as per Example 7.

Example 10 [LenSplit] Consider xbyu = caxzb. The length of the prefix xb is strictly less than the length of the prefix cax, which allows us to infer that yu = ˆyzb for some new string variable yˆ 6= . We can rewrite yu to yzb ˆ (see Example 9) and after trimming, we wind up with the equation xbyˆ = cax, which is unsatisfiable (see Example 5).

Example 11 [EqWords] Consider xbcay = ycbax. Let W<sup>x</sup> ca and W<sup>y</sup> ca be the number of occurrences of a word ca in x and y respectively. If the equation is satisfiable, then both sides must have the same number of ca occurrences. To enforce this, our algorithm generates the constraint W<sup>x</sup> ca + 1 + W<sup>y</sup> ca = W<sup>y</sup> ca + W<sup>x</sup> ca, which is unsatisfiable. Consider bwbxacv = vbabxcw, which shows that counting words requires more care than what the above example suggests, *e.g.*, to count the occurrences of bc, we have to take into account whether c is a prefix of w, whether b is a suffix of x, whether x is empty, and so on. We use 0-1 indicator variables P w c , S<sup>x</sup> b and x, denoting the above conditions, respectively. Now, with just the ab occurrence analysis, we can use variable splitting on w (w ends in an a) and then on v (v ends in an a) to derive a contradiction.

Example 12 [SAT] None of the string solvers we tried are able to solve the string equation xcyczvycya = yacwazvbux. This equation is outside the scope of Kepler22, StrSolve, Hampi and Sloth. Sloth, TRAU and S3P return unsat, which is wrong. Norn, Z3Str3 and CVC4 timed out after 1,000 seconds, which shows that existing tools are incomplete, in a practical sense. Our solver finds the assignment x = aba, y = ab, u = cabc and v, w, z = in a fraction of a second.

#### III. BLOCKS, SUBSTITUTIONS AND THEORIES

Suppose that a sequence u has an l length subsequence of consecutive occurrences of the constant a. This subsequence can be compactly represented by the pair (a, l), which we refer to as a *block*: pairs in Γ × PExp where

PExp := P | x | PExp + PExp | PExp − PExp and x is a variable over positive natural numbers, P. We require that a PExp is positive. A sequence that allows blocks is called an *extended sequence* (es); an *extended sequence equation* (ese) is similarly defined. The set of extended sequences es is (Γ∪(Γ × PExp)∪ Z) ∗ . We define a function compress : es → ((Γ,PExp) ∪ Z) <sup>∗</sup> which given an (extended) sequence, replaces contiguous occurrences of each constant by its block such that no two blocks of the same constant are adjacent to each other, thus returning a unique *maximally compressed sequence*. We define the following useful functions, which given an extended sequence U: (1) Elems : es → 2 Γ∪Z∪(Γ,PExp) returns the set of elements of U; (2) Atoms : es → 2 <sup>Γ</sup>∪Z returns the set of variables and constants occurring in U; (3) Consts : es → 2 <sup>Γ</sup> returns the set of constants in U. (4) Vars : es → 2 <sup>Z</sup> returns the set of variables in U. These functions extend naturally to eses and to sets of ess and eses. An extended sequence U *represents* a sequence u if u is obtained from U by replacing every block (α, n) by α repeated n times. Note that n needs to be a positive integer. Extended sequences U and V are syntactically equivalent if they represent the same sequence. We use ≡ to denote syntactic equivalence. For example, (α, 2)αX ≡ α(α, 2)X, as both of them represent the sequence αααX. Notice that syntactic equivalence is an equivalence relation.

We define a substitution σ to be a partial function of the form σ : es \* es. Given substitution σ, let σ<sup>v</sup> be σ restricted to Z and let σ<sup>s</sup> be σ\σv. Let dom(f) and cod(f) be the domain and codomain of function f, respectively. Note that dom(σv) ⊆ Z, so σ<sup>v</sup> is a normal substitution. Substitutions σ<sup>v</sup> and σ<sup>s</sup> partition σ and have disjoint domains. We say that σ<sup>s</sup> is an *extended* substitution, as its domain may contain sequences. We require substitutions to be *well-typed*, *i.e.*, σ<sup>v</sup> must map unit variables to sequences of unit length. Uσ stands for the application of substitution σ to U ∈ es. This notation extends naturally to equations and sets of equations. In order for application to be well-defined, we require that σ is *consistent*, as defined below. We say that σ is *uniquely defined* if for all x, y ∈ dom(σ), if x 6= y then Atoms(x) ∩ Atoms(y) = ∅. To see why we require this, consider the case where σ<sup>v</sup> = {x:ab, y:a} and σ<sup>s</sup> = {yax:aba}; note that (yax)σ is ambiguous.

Given two uniquely defined substitutions, σ and τ , we say that they are *equivalent*, written σ ≡ τ , if for all U ∈ es, we have Uσ ≡ Uτ . We say that σ is *consistent* if it is uniquely defined and h∃τ :: dom(τ ) ⊆ Z ∧ σ ≡ τ i, *i.e.*, σ is equivalent to a normal substitution. Consider σ = {xay:bbb}. Even though σ is uniquely defined, it can not be expressed as a normal substitution. From now on, unless we say otherwise, all substitutions are implicitly assumed to be consistent. A substitution σ is said to solve an ese U = V if Uσ ≡ V σ; σ solves Q, a set of eses, if σ solves every ese in Q. A *word* ab is an es in which no prefix is a suffix.

Theorem 1. *If* σ *is a consistent substitution and* x1, . . . , x<sup>n</sup> ∈ Z *are distinct variables such that* n ≥ 0 *and* {x1, . . . , xn} ∩ Vars(dom(σ)) = ∅*, then* σ ∪ {x1:V1, . . . , xn:Vn} *(where*

V1, . . . , V<sup>n</sup> *are extended sequences of the right type) is a consistent substitution.*

A theory is a pair T = (Σ,I), where Σ is a signature and I is a class of Σ-interpretations, the models of T. A set of formulas, Ψ, entails in T a Σ-formula φ, written Ψ <sup>T</sup> φ, if every interpretation in I that satisfies all formulas in Ψ satisfies φ as well. The set Ψ is unsatisfiable in T if Ψ <sup>T</sup> ⊥.

Let LIA be a theory with signature (0, 1, +, −, ≤) interpreted over the standard model of integers Z. A linear constraint is a formula of the form P i∈[1..n] aix<sup>i</sup> ≤ b, where x<sup>i</sup> are variables and a<sup>i</sup> and b are integer constants. For a collection of linear constraints C, C LIA ⊥ means that C is unsatisfiable in LIA, whereas C 2LIA ⊥ means that a model exists for C. Our algorithm accepts and generates linear constraints on the conjunction of input string equations. It assumes a sound, complete and terminating backend ILP solver for such constraints. Let ES be a theory of (extended) sequences over a signature ΣES with two sorts: extended sequences (es) and integers (Z) along with an infinite set of variables over each sort. ΣES also includes constants in Γ, PExp expressions, blocks, (extended) sequences and functions len interpreted as the string length function, countConst interpreted as a function counting the number of a specified constant in a sequence and countWords interpreted as a function counting the number of specified words in a sequence.

#### IV. MPMT-BASED STRING SOLVER

Our algorithm, SeqSolve, accepts a conjunction of string equations Q as well as initial constraints Cinit and returns either unsat, unknown or sat along with a solution. Cinit is a set of initexp's defined as

$$\begin{aligned} \mathit{LExp} &:= \ \mathbb{Z} \mid x \mid \mathsf{len}(u) \mid L\mathsf{Exp} + L\mathsf{Exp} \mid L\mathsf{Exp} - L\mathsf{Exp} \\ \mathit{in} &:= \ \mathit{LExp} \ (<\mid \le \mid > \mid \ge \mid = \mid \neq) \ \mathit{LExp} \end{aligned}$$

where x is an integer variable (Z), u is an (extended) sequence and len : es → N is a function that returns length of u. We refer to variables occurring in PExp and LExp expressions as *numeric variables*. Central to the algorithm is a nondeterministic transition system TranSeq with rules that operate on configurations consisting of (extended) sequence equations and sets of LIA constraints.

Our decision procedure can be integrated into MPMT solvers in a fine-grained way since MPMT is based on branching, using the branch-and-cut framework. However, in order to make the paper more self contained, we present TranSeq and SeqSolve with as few dependencies on the MPMT framework as possible.

Our decision procedure can be integrated into SMT solvers using the idea of *recursive solvers*: these are solvers whose decision procedures may depend on the solvers themselves. For example, we can integrate our decision procedure into Z3, even though our decision procedure uses Z3 as a backend solver, by using a separate Z3 process to handle the LIA constraints and one can use this integration as a backend solver for yet another decision procedure, and so on. As far as we know, we are the first to propose the idea of recursive solvers. For SMT solvers like Z3 that provide contexts and a stack with a push-pop interface to manage constraints, integration can be achieved using these features by creating a new context or stack frame, thereby allowing decision procedures to query the SMT solver without polluting its state.

#### *A. Configurations*

The algorithm works on configurations that include tuples of the form hunsati, hunknowni, hsat, σ, Ci and hQ, σ, vars, Ci where (1) Q is a set of eses, (2) σ : es \* es is a (consistent) substitution, (3) vars is a superset of the variables in Z which occur in Q, (4) C is a union of constraints Clen, Cconsts , Cwords and a set of linear constraints corresponding to Cinit, where (i) Clen is a set of linear constraints regarding the lengths of variables in vars. For x ∈ vars, l<sup>x</sup> is an integer variable denoting the length of x and <sup>x</sup> is a 0-1 indicator variable indicating whether x is empty. Linear constraints in Clen and Cinit are over these integer variables and over PExp variables; (ii) Cconsts is a set of linear constraints regarding the number of occurrences of constants in variables from vars. For x ∈ vars, n x a is an integer variable denoting the number of occurrences of the constant a in x. Linear constraints in Cconsts are over these variables as well as over variables of Clen; (iii) Cwords is a set of linear constraints regarding the number of words occurring in variables from vars. Let x ∈ vars and s ∈ consts<sup>∗</sup> . Then W<sup>x</sup> <sup>s</sup> denotes the number of s occurrences in x; P x s and S x s are 0-1 indicator variables indicating whether x begins with s and ends with s, respectively. Linear constraints in Cwords are over these variables as well as over variables of Clen.

The reason why we distinguish between Clen, Cconsts and Cwords is that it makes it easier to consider simplified transition systems that include only a subset of these kinds of constraints. We define sets consts and Cfuel where (1) consts is a superset of the constants from Γ occurring in Q and (2) Cfuel is a set of linear constraints over the l<sup>x</sup> variables, used to guarantee termination. Both consts and Cfuel are generated once and never modified by our transition system. The rules in TranSeq depend on auxiliary functions that are used to generate LIA constraints or to simplify equations. All of these functions are described in the full version of this paper.

#### *B. Transition System TranSeq*

We describe a non-deterministic transition system TranSeq. TranSeq consists of a set of rules called derivation rules. A derivation rule applies to a configuration K if all of the rule's premises are satisfied by K. Such a rule is *enabled* for K. A derivation tree is a tree where each node is a configuration and the children of any non-leaf node are exactly the configurations obtained by applying one of the derivation rules to the node. A configuration is *terminal* if no rules can be applied to it. We prove that terminal configurations are either of the form hunsati, in which case we call them *unsat* terminal nodes, hunknowni, in which case we call them *unknown* terminal nodes, or of the form hsat, σ, Ci, in which case we call them *sat* terminal nodes and σ, C can be used to generate a satisfying assignment to the equations appearing in the root of the tree.

A configuration K = hQ, σ, vars, Ci is sat (unsat) iff Q ∪ C ∪ Cfuel is sat (unsat). K is C*-*sat iff Q ∪ C is sat. Notice that an unknown terminal node may be sat (or unsat). This discrepancy is due to the Cfuel constraints, which are provable upper bounds on the lengths of minimal solutions, but only if we have no length constraints in the input, so it is possible that K is C-sat, but the configuration is unsat and we generate an unknown terminal node. The derivation rules of TranSeq are given in guarded assignment form and can be categorized into three groups: (1) Terminal rules: Rules that yield terminal nodes. (2) Inference rules: Rules that generate new inferences. (3) Branching rules: Rules that generate multiple subproblems.

A derivation tree is *closed* if all its leaf nodes are terminal nodes. A derivation tree is *unsat-closed* if it is closed and all of its leaf nodes are unsat-terminal nodes. A derivation tree is *unknown-closed* if it is closed, has at least one unknown terminal node and has no sat-terminal nodes. We prove that if a derivation tree is unsat-closed, then the conjunction of the equations and constraints appearing in the root of the tree are unsatisfiable. A derivation tree for a set of sequence equations Q = {u1=v1, u2=v2, . . . , un=vn} and some initial length constraints Cinit (if provided) is a tree whose root, genRoot(Q, Cinit), is defined in Algorithm 1, where Choose(X) is a function that given a non-empty set X, returns an element of X. Clen, Cconsts and Cwords are initialized with linear constraints by functions initLen, initConsts and initWordCount respectively. These functions generate constraints which are satisfiable for any string variable. Cfuel comprises of constraints on the size of the minimum solution of each equation in Q which are calculated in function initFuel and are based on results from [23]. The sets consts and vars are supersets of the constants and variables occurring in Q, respectively.

We define the function toLIA, which given an initexp returns a linear constraint. Given len(x), where x is a sequence variable, toLIA returns lx; we extend this to initexp expressions in the obvious way and use toLIA to also generate fuel constraints. We denote the set of words we are interested in counting as W, which is global.

#### *C. Rules in TranSeq*

We now describe each rule in TranSeq. The conclusion of a rule describes how each component of a configuration is changed, if it does. Rules with two or more conclusions separated by k, are branching rules, where each of the configurations are starting configurations for new branches in their derivation tree. In derivation rules, if Q is relevant, it appears on the top-left corner in the premise and as the last line of a concluding branch. A, t is an abbreviation for A∪{t} and A∼t

Algorithm 1 genRoot(Q, Cinit) : Given input set of string equations Q, genRoot generates the root node of a derivation tree.

1: σ ← {} 2: vars ← {x | x ∈ Z ∧ x ∈ uv ∧ u=v ∈ Q} 3: consts ← {a | a ∈ uv ∧ a ∈ Γ∧ u=v ∈ Q} 4: if consts = ∅ ∧ vars ∩ Y 6= ∅ then 5: consts ← {Choose(Γ)} 6: Clen ← S v∈vars initLen(v) 7: Cconsts ← S v∈vars initConsts(v, consts) 8: Cwords ← S v∈vars,w∈W initWordCount(v, w) 9: C ← toLIA(Cinit) ∪ Clen ∪ Cconsts ∪ Cwords 10: Cfuel ← initFuel(Q) 11: return hQ, σ, vars, Ci

abbreviates A \ {t}. We use ≡ (6≡) for syntactic equivalence (in-equivalence) and = (6=) for semantic equality (inequality).

Terminal rules When Q is empty, if C is unsatisfiable, LIAUnsat infers unsat otherwise Sat returns a sat configuration.

$$\frac{\mathcal{C} \models\_{\mathsf{i}\mathsf{A}} \bot}{\langle unsat\rangle} \text{ \texttt{\tiny\mathsf{L}}\mathsf{I}\mathsf{A}\mathsf{U}\mathsf{n}\mathsf{s}\mathsf{at}} \qquad \frac{\{\}}{\langle sat, \sigma, \mathcal{C}\rangle} \begin{array}{c} \mathsf{\texttt{\tiny\mathcal{C}}\mathsf{t}} \mathsf{\texttt{\tiny\mathcal{C}}\mathsf{t}} \\ \hline \langle sat, \sigma, \mathcal{C}\rangle \end{array} \mathbf{\texttt{\tiny\mathcal{C}}\mathsf{t}}$$

If the fuel constraints are needed to show unsatisfiability, then the rule FuelUnsat returns unsat if no initial linear constraints were provided, otherwise the rule Unknown returns unknown. Terminal rules are subject to fairness constraints, as described later.

$$\begin{array}{c c c} \hline \\ & \mathsf{C}\_{init} = \emptyset & \mathsf{C} \cup \mathsf{C}\_{fuel} \models\_{\mathsf{L}\mathsf{A}} \bot \\ & \langle unsat \rangle \\ \hline \\ \mathcal{C}\_{init} \neq \emptyset & \mathcal{C} \not\models\_{\mathsf{L}\mathsf{A}} \bot & \mathcal{C} \cup \mathcal{C}\_{fuel} \models\_{\mathsf{L}\mathsf{A}} \bot \\ & \langle unknowm \rangle \\ \end{array} \begin{array}{c} \mathsf{Fun} \mathsf{U} \mathsf{n} \mathsf{s} \mathsf{a} \\ \mathsf{N} \mathsf{n} \mathsf{k} \mathsf{n} \mathsf{s} \mathsf{a} \\ \mathsf{N} \mathsf{k} \mathsf{n} \mathsf{o} \mathsf{n} \mathsf{m} \\ \mathsf{N} \mathsf{k} \mathsf{n} \mathsf{o} \mathsf{n} \mathsf{m} \\ \end{array}$$

If there exists an equation with syntactically different extended sequences on both sides, ConstUnsat infers unsat.

$$\frac{\{U = V, \dots\} \quad U \not\equiv V \quad \text{Vars}(UV) = \emptyset}{\langle unsat \rangle} \quad \text{Const} \cup \text{nsat} $$

Note that we do not apply substitution σ to U and V when checking for syntactic equivalence, as shown below.

$$\frac{\{U = V, \dots\} \quad \quad U\sigma \not\equiv V\sigma \qquad \text{Vars}(UV) = \emptyset}{\langle unsat\rangle} \quad \text{Const} \cup \text{nsat} \quad \quad \quad \langle v\rangle = \langle v\rangle$$

This is because, for any equation U=V ∈ Q, we get the original rule due to Uσ = U as a result of the invariant Qσ = Q, which we prove later.

When one side of an extended equation contains a constant or a block, while the other side is empty, ConstEmpty deduces unsat. If both sides begin with blocks of unequal constants, DiffConsts deduces unsat.

$$\begin{array}{c} \{U = \epsilon, \dots\} \qquad \alpha \in \mathsf{Atoms}(U) \qquad \alpha \in consts \\ \qquad \langle unsat \rangle \\ \hline \{ (\alpha, l) U = (\beta, m) V, \dots \} \qquad \alpha \neq \beta \\ \hline \langle unsat \rangle \end{array} \begin{array}{c} \mathsf{ConstEmpts} \\ \mathsf{DiffConsts} \end{array}$$

If one side of an equation contains a unit variable while the other side is empty, then YVarEmpty infers hunsati.

$$\frac{\{U = \epsilon, \dots\} \quad \quad e \in U \quad \quad e \in \mathcal{Y}}{\langle unsat \rangle} \text{ \textquotedbl{}\mathsf{VVar}\mathsf{Empty}\mathsf{y}\mathsf{}}$$

The rules ConstEmpty and DiffConsts deduce unsat based on how terms in an equation start, but there is a symmetry here that allows us to define rules that make the same deduction based on how terms end. For example, the symmetric version of DiffConsts would start with {U(α, l) = V (β, m), . . .}, but would otherwise be identical to DiffConsts. When rules have this kind of symmetry, we denote it by underlining the name of the rule in its definition. These symmetric rules help with efficiency, but are not needed for completeness, so to simplify the rest of the presentation, we proceed as if they do not exist.

Inference rules Trim removes syntactically equal prefixes and suffixes from both sides of an equation; note that one of a, b can be . EqElim removes eses whose both sides are syntactically equivalent. Observe that Trim can be used to reduce an equation U=V which is syntactically equivalent on both sides, to get =, in which case we get syntactic equivalence of both sides trivially.

$$\begin{array}{llll} \{aUb = cVd, \dots\} & a \equiv c & \\ \hline |ab| > 0 & b \equiv d & \text{Term} & \begin{cases} U = U, \dots \text{.} \end{cases} & \text{EqElim} \end{array}$$

Decompose splits an ese U=V into multiple equations using length constraints. A simple example is given in Example 3.

$$\frac{\{U = V, \dots\} \qquad |\text{splitEq}(U, V, \mathcal{C})| > 1}{\text{splitEq}(U, V, \mathcal{C}) \cup \{\dots\}} \text{ } \text{Decomppose}$$

Compress converts an equation u=v ∈ Q into a maximally compressed sequence. Observe that the premise requires that there is at least one constant element in u=v. Note that blocks such as (a, 1) are not constants, as they are not elements of Γ.

$$\frac{\{u = v, \dots\} \quad \mathsf{Elements}(uv) \cap \Gamma \neq \emptyset}{\{compress(u) =compress(v), \dots\}} \text{ \texttt{Compress}}$$

VarSubst formalizes the idea from Example 8. Given W, a non-empty subsequence in Q satisfying the conditions below, the rule replaces W with a new variable z. We show later that for every node in a derivation tree generated by our algorithm, Qσ = Q holds; hence, the first condition for consistency of substitutions is satisfied. The second consistency condition is satisfied due to the premise that requires atoms of W and Q{W:z} to be disjoint. Hence, the substitution in the new configuration is consistent. The LIANewVar procedure generates numeric constraints for new variables. After this rule, it is called implicitly whenever a new variable is introduced.

$$\begin{array}{l} \{U = V, \dots\} \qquad \langle \exists S, T :: SWT = U \land |W| > 1\rangle\\ \mathsf{Atoms}(W) \subseteq vars \qquad z \in \mathcal{X} \qquad z \notin vars\\ \mathsf{Atoms}(W) \cap \mathsf{Atoms}(\{U = V, \dots\}\{W : z\}) = \emptyset\\ \hline \mathsf{LIANewVar}(z)\\ \sigma \leftarrow \sigma, W: z\\ \{U = V, \dots\}\{W \colon z\} \end{array} \text{Var}\mathsf{Subst}$$

Rewrite replaces a subsequence S of U by T, given that S=T is an equation in Q. Rewrite can choose which occurrences to replace. Infinite derivation trees are ruled out with a fairness requirement that only allows us to use the Rewrite rule a finite number of times.

$$\frac{\{U = V, S = T, \dots\} \quad S \in U}{\{U \{S \colon T\} = V, S = T, \dots\}} \text{ \textit{Rewrite}}$$

EqLength, EqConsts and EqWords generate length, constant count and word count constraints implied by an equation. Function equateWordCount returns a linear constraint equating the number of occurrences of a word w in U and V .

$$\begin{array}{ll} \{U = V, \ldots\} & \mathsf{equateLen}(U, V) \not\subseteq \mathcal{C} \\ \hline C\_{len} \leftarrow C\_{len} \cup \mathtt{equateLen}(U, V) \\\\ \{U = V, \ldots\} & \mathsf{equateConsts}(U, V) \not\subseteq \mathcal{C} \\ \hline C\_{const} \leftarrow C\_{const} \cup \mathtt{equateConsts}(U, V, consts) \\\\ \{U = V, \ldots\} & w \in consts \geq ^{2} \\ \hline \mathtt{equateWord\\_Count}(U, V, w) \not\subseteq \mathcal{C} \\ \hline C\_{words} \leftarrow C\_{words} \cup \mathtt{equateWord\\_Count}(U, V, w) \end{array} \begin{array}{l} \mathsf{Eq\text{\\$Weords}} \\ \hline \mathtt{equates}(U, V, w) \end{array}$$

VarElim allows us to eliminate variables.

$$\frac{\{x = V, \dots\} \quad \quad x \notin V \quad \quad x \in \mathcal{X}}{\sigma \gets \sigma, x \colon V} \text{ \(\mathsf{Var}\) lim}$$

Given an equation where one side starts with c occurrences of variable x and the other starts with m occurrences of constant β, the rule VarSplit infers shape information about x involving fresh variable y. x can not be empty, and the prefix of x <sup>c</sup> must be syntactically equivalent to (β, m). Hence, VarSplit infers that x is (β, k)y, where c ∗ k ≥ m. Note that c is a constant, hence expressions such as c ∗ k do not take us out of the LIA fragment. Also note that if k < m, y will have to start with β as well, which we do not want. Hence we add an implication that if k < m then y is empty. We extend the set of equations with x=(β, k)y. Anytime we extend a the set of equations with an equation of the form x= . . ., we call VarElim to eliminate the variable x.

$$\begin{aligned} \{x^c(\alpha, l)U = (\beta, m)V, \dots\} & \quad \alpha \neq \beta, c > 0\\ \hline x, y \in \mathcal{X} & \quad y \notin vars \\\hline C\_{len} \leftarrow C\_{len}, & \; k > 0, \; (c - 1) \ast k < m \le c \ast k, \\\ & \quad k < m \Rightarrow \epsilon\_y = 1 \\\ C\_{words} \leftarrow C\_{words}, & \; k < m \Rightarrow S^x\_\beta = 1 \\\ & \quad \{x = (\beta, k)y, \quad x^c(\alpha, l)U = (\beta, m)V, \dots\} \end{aligned} \} \text{VarSplit}$$

Length constraints alone may not always be enough to split an equation. LenSplit introduces a new variable on one side of an equation such that the resulting equation is clearly split into smaller and possibly more tractable equations. Example 10 illustrates a simple example.

$$\begin{array}{c} \{UW = SzV, \dots\} \\ \hline \\ \begin{array}{c} y, z \in \mathcal{X} \\ \hline C\_{len} \leftarrow C\_{len}, \epsilon\_{y} = 0 \\ \{Uy = Sz, W = yV, \dots\} \end{array} \end{array} \begin{array}{c} \text{len}(U) < \text{len}(Sz) \\\hline y \notin \text{vars} \\ \hline C\_{len} \leftarrow C\_{len}, \epsilon\_{y} = 0 \\\hline \end{array} \begin{array}{c} \text{lenSplit} \\\hline \\ \text{lenSplit} \\ \hline \\ Y = YS, \dots \end{array} \end{array}$$

Inferences made by the backend LIA solver can be used to infer sequence variables. LIAEmpty concludes that a variable x is empty if <sup>x</sup> = 1 is derived by the solver. Similarly, x starts (ends) with α iff the solver derives P x <sup>α</sup> = 1 (S x <sup>α</sup> = 1).

$$\begin{array}{ll} \mathcal{C} \models\_{\mathsf{i}\mathsf{i}\mathsf{i}} \epsilon\_{x} = 1 & \mathcal{C} \models\_{\mathsf{i}\mathsf{i}\mathsf{i}} P^{x}\_{\alpha} = 1 & y \in \mathcal{X} \\ \frac{x \in vars}{\{x = \epsilon, \ldots\}} \text{ \texttt{\tiny\mathsf{i}\mathsf{i}\mathsf{A}\mathsf{E}m\mathsf{p}\mathsf{t}y} & \frac{x \in vars}{\{x = \alpha y, \ldots\}} & y \notin vars \\ \end{array} \\ \begin{array}{ll} \mathcal{C} \models\_{\mathsf{i}\mathsf{i}} S^{x}\_{\alpha} = 1 & y \in \mathcal{X} \\ \frac{x \in vars}{\{x = y\alpha, \ldots\}} \text{ \texttt{\tiny\mathsf{i}\mathsf{A}\mathsf{E}m\mathsf{d}} \\ \end{array} \\ \begin{array}{ll} \mathcal{C} \models\_{\mathsf{i}\mathsf{i}\mathsf{a}} S^{x}\_{\alpha} = 1 & y \notin vars \\ \frac{\{x = y\alpha, \ldots\}}{\{x = y\alpha, \ldots\}} \text{ \texttt{\tiny\mathsf{i}\mathsf{A}\mathsf{E}m\mathsf{d}} \\ \end{array} \end{array}$$

Given an equation where one side is empty, XVarEmpty infers that a variable x ∈ X in the other side must also be empty. If the two sides of an ese start with unit variables x and y, then DiffYVars infers that both the variables must be equal.

$$\begin{array}{l} \{U = \epsilon, \dots\} \\ x \in U \\ x \in \mathcal{X} \\ \{x = \epsilon, U = \epsilon, \dots\} \end{array} \begin{array}{l} \{xU = yV, \dots\} \\ x \neq y \\ x, y \in \mathcal{Y} \\ \{x = y, U = V, \dots\} \end{array} \begin{array}{l} \{xU = yV, \dots\} \\ x \neq y \\ \{x = y, U = V, \dots\} \end{array}$$

Branching rules Given an equation where one side starts with a block of α, while the other side starts with a unit variable e, UnitConst infers that either the length of the α block is greater than one, or equal to one. Observe that some constraints in this rule are emphasized with a wavy underline. If such constraints are implied by C, we can directly jump to their corresponding branch. Practically, it helps to not branch, if one of the underlined constraints can be derived in the premise.

$$\begin{array}{c} \{eU = (\alpha, l)V, \dots\} \quad e \in \mathcal{Y} \\\hline C\_{len} \leftarrow C\_{len}, l \equiv 1 \quad \parallel \quad C\_{len} \leftarrow C\_{len}, l \geq 1 \\\{e = \alpha, \ U = V, \dots\} \quad \qquad \{e = \alpha, \ U = (\alpha, l - 1)V, \dots\} \end{array}$$

Given an equation where one side starts with a unit variable e while the other side starts with sequence variable y, UnitVar infers that either y is empty, or e is a prefix of y.

{eU=yV , . . .} e ∈ Y y, z ∈ X z /∈ vars UnitVar Clen ← Clen, <sup>y</sup> = 1 ✿✿✿✿✿ k Clen ← Clen, <sup>y</sup> = 0 ✿✿✿✿✿ {y=, eU=V , . . .} {y=ez, U=zV , . . .}

If both sides of an equation start with blocks of the same constant α, SimConsts infers that either both blocks have the same length or one of them has length more than the other. So this rule should have three branches, one equating l and m, while the other two deducing a strict inequality between them. However, there are two branches, one equating l and m, while the other deducing m > ˆ ˆl. This is because, for the sake of conciseness we introduce "hatted" variables U, ˆ V , ˆ ˆl, mˆ and βˆ. A branch with hatted variables signifies the presence of another branch where the hatted variables are replaced by their substitutions defined as:

$$\{\hat{x}; y, \hat{y}; x, \hat{X}; Y, \hat{Y}; X, \hat{U}; V, \hat{V}; U, \hat{l}; m, \hat{m}; l, \hat{\alpha}; \beta, \beta; \alpha\}$$

Notice that we also have underlined constraints in the conclusion. So, the rule SimConsts represents six rules, three after expanding hatted variables where none of the underlines constraints is implied by C, and the rest considering presence of each of the underlined constraints in the premise of its corresponding rule.

$$\frac{\{(\alpha, l)U = (\alpha, m)V, \dots\}}{C\_{len} \leftarrow C\_{len}, \underline{m} \equiv \underline{l} \quad \parallel \quad C\_{len} \leftarrow C\_{len}, \hat{\underline{m}} \geq \hat{\underline{l}}} \frac{\texttt{SimConst}}{\underline{l}}$$

$$\{U = V, \dots\} \qquad \qquad \qquad \{\hat{U} = (\alpha, \hat{m} - \hat{l})\hat{V}, \dots\}$$

Similar to SimConsts, DiffXVars also uses both hatted variables and underlined constraints which gives rise to a total of ten rules. If both sides of an equation start with syntactically different variables x, y ∈ X , and none of the underlined constraints is implied by C, then DiffXVars infers that either one of them is empty or they are semantically equal or one of them is a prefix of the other.

$$\begin{array}{c} \{xU = yV, \dots\} \quad x \neq y\\ z \notin vars \qquad x, y \in \mathcal{X} \qquad z \in \mathcal{X} \\\hline C\_{len} \leftarrow C\_{len}, l\_{\hat{x}} > l\_{\hat{y}}, \parallel \ C\_{len} \leftarrow C\_{len}, l\_{x} = l\_{y}, \\\ \epsilon\_{\hat{x}} = \epsilon\_{\hat{y}} = \epsilon\_{z} = 0, \hat{x} = l\_{\hat{y}} + l\_{z} \qquad \epsilon\_{x} = \epsilon\_{y} = 0\\\ \{\hat{x} = \hat{y}z, z\hat{U} = \hat{V}, \dots\} \qquad \{x = y, U = V, \dots\} \\\ \parallel \quad C\_{len} \leftarrow C\_{len}, \epsilon\_{\hat{x}} = 1\\\ \{\ldots\} \end{array}$$

{xˆ=,Uˆ=ˆyV , . . . ˆ }

Finally, VarConst fires when one side of an equation starts with a constant block (α, l) while the other side starts with a variable x. Again, VarConst represents eight rules due to the presence of underlined constraints in its branching conclusions. Assuming none of these constraints is implied by C, the first branch sets x empty; second branch sets length of x less than l; third branch equated x to (α, l), while the last branch sets x as a block of α whose length is greater than l, possibly followed by another variable y that does not start with α.

$$\begin{array}{llll} \{xU = (\alpha, l)V, \dots\} & x, y \in X & y \notin vars \\ \hline C\_{len} \leftarrow C\_{len}, \epsilon\_x = 1 & \mid & C\_{len} \leftarrow C\_{len}, 0 < l\_x < l \\ \hline \{x = \epsilon, \ U = (\alpha, l)V, \dots\} & \{x = (\alpha, l\_x), \ U = (\alpha, l - l\_x)V, \dots\} \\ \hline C\_{len} \leftarrow C\_{len}, 0 < l\_x = l \parallel & C\_{len} \leftarrow C\_{len}, 0 < l < l\_x \\ \hline \{x = (\alpha, l), \ U = V, \dots\} & \{x = (\alpha, l\_x)y, \ xU = (\alpha, l)V, \dots\} \end{array}$$

### *D. SeqSolve definition*

We define SeqSolve in Algorithm 2. It takes a set of sequence equations W and an optional set of length constraints Cinit as input and either returns a sat with a solution, unknown or unsat.

#### V. CORRECTNESS OF SEQSOLVE

Full proofs of correctness of SeqSolve appear in the full version of this paper. In the interest of brevity, we outline the structure of proofs in this section. First, we define correctness.

Algorithm 2 SeqSolve takes a set of (extended) sequence equations W and optionally a set of linear constraints Cinit as input and either returns a sat with a solution,unknown or unsat.


Definition 1. *A* string equation solver *is an algorithm that takes as input a set of string equations and a set of linear constraints. Its output is either "Unsat," "Unknown," or "Sat" and an assignment.*

Definition 2. *A string equation solver is* sound *if it never lies, by which we mean: (1) when it returns "Sat," the conjunction of the string equations and the linear constraints is satisfiable and the assignment returned is a satisfying assignment and (2) when it returns "Unsat," the conjunction of the string equations and the linear constraints is unsatisfiable.*

Definition 3. *A string equation solver is* partially correct *if it is sound and terminating.*

Definition 4. *A string equation solver is* fully correct *if it is sound, terminating and never returns "Unknown."*

Note that a sound solver can be turned into a partially correct solver by adding a timeout, which results in the solver returning "Unknown." We prove that our solver is fully correct for the theory of string equations by showing that when the input consists of only a conjunction of string equations Q, our transition system generates a derivation tree that is unsatclosed iff the input is unsatisfiable; otherwise it generates a derivation tree containing a sat terminal node, from which we can extract a satisfying assignment for the input. When the input also includes linear constraints, our solver is partially correct as it may also generate an unknown-closed derivation tree. We show that SeqSolve is sound using the following theorems.

Theorem 5. *Given inputs* Q, Cinit *such that* SeqSolve *generates a tree* T *with a* sat *terminal node* hsat, σ, Ci*, then* σ, C *can be used to generate a solution for* Q, Cinit*.*

A configuration is *var-compliant* iff it is of the form hQ, σ, vars, . . .i where Vars(σ) ⊆ vars (by Vars(σ) we mean Vars(dom(σ)) ∪ Vars(cod(σ))). A configuration is *numvarcompliant* iff (1) it is of the form hQ, σ, vars, Ci and all numeric variables appearing in it are also in C and (2) for a variable x ∈ vars, initLen(x) ∪ initConsts(x, consts) ∪ initWordCount(x, consts) ⊆ C. A configuration is *good* iff it is either terminal or it is disjoint, var-compliant and numvarcompliant. A derivation tree is *good* if all of its nodes are good configurations. It turns out that all SeqSolve-generated derivation trees are good.

Lemma 7. *Given input* Q, Cinit *where* Q *is a set of (extended) sequence equations and* Cinit *is a set of linear constraints,* genRoot *returns a good, non-terminal configuration.*

Lemma 12. TranSeq *rules preserve goodness,* i.e.*, when applied to a good configuration, they produce good configurations.*

SeqSolve is subject to the following fairness conditions: (1) LIAUnsat, FuelUnsat and Unknown are *weakly-fair* rules. First note that once any of these rules is enabled, it stays enabled. We require that no branch of a derivation tree contains a suffix in which a weakly-fair rule is infinitely enabled, yet never applied. (2) Rewrite can only be applied a finite number of times along any branch.

A *fair* derivation tree is one which respects the above fairness conditions. SeqSolve generates fair and good derivation trees. We use good derivation trees to show that TranSeq is sound.

Theorem 6. *Every* TranSeq *rule is sound when applied to a good configuration.*

The termination of SeqSolve (and TranSeq) depends on a bound on the minimum lengths of solutions of string equations as described in [23] and on fair derivation trees.

Theorem 9. SeqSolve *is terminating.*

Theorem 10. SeqSolve *is a partially correct string equation solver.*

Theorem 11. SeqSolve *is a fully correct string equation solver when the input does not include any linear constraints.*

# VI. IMPLEMENTATION OF SEQSOLVE

Our implementation of SeqSolve along with all the benchmarks used is publicly available [24]. SeqSolve is implemented in ACL2s [25] which allows us to (1) define datatypes like blocks, sequences and valid Z3 expressions (used to query Z3) (2) define TranSeq rules, which requires proving termination and input/output contracts (input/output types) (3) prove basic theorems relating datatypes (subtypes,etc) and properties needed for above proofs and (4) make essential use of the Z3 interface ACL2s provides to solve ILP constraints. SeqSolve provides various settings that can be used to control how aggressively it generates linear constraints; however, all of the results reported in this paper are with the default settings. We implemented SeqSolve as a standalone decision procedure as opposed to making it a part of an MPMT solver. This makes it easier to compare our tool with other string solvers in an apples-to-apples way, avoiding the complications that would arise from the use of different underlying solvers and frameworks.

We apply a few TranSeq rules until we reach a fixpoint before generating the derivation tree in order to simplify the input problem. These preprocessing steps include Decompose,

VarElim, VarSubst and Compress. After reaching a fixpoint, we use LIAUnsat to check if the set of initial constraints and the linear constraints we generated above are unsat.

In our implementation of the rule EqWords, we only use words with the property that no non-empty prefix of w is a suffix of w. Since our solver makes many low-level calls to Z3, it does this in an incremental way. In addition, care is taken to avoid unnecessary calls to Z3, *e.g.*, LIAUnsat is not checked after running Trim, EqElim, Decompose, Compress, VarSubst, Rewrite and VarElim, because in all of these rules, we do not update C. We do not apply any branching rules, unless we have no other options. Our implementation supports string operations like charAt, contains, indexOf, substr, prefixOf and suffixOf. Each of these operations can be converted to a problem in the theory of extended sequences *e.g.*, given charAt constraint e = (str.at s n), we convert it into the conjunction of the string equation s = xey and len(x) = n, where e ∈ Y and x, y ∈ X . Given the constraint (str.contains s t), we convert it into the string equation s = xty where x, y ∈ X .

### VII. EVALUATION

We compared our solver against Z3Str2 and Z3Str3 (Z3 version 4.8.8), Norn 1.0, Z3-Trau, Sloth 1.0 and CVC4 1.7. These are the only string solvers we know of that solve string equations with length constraints and ran without crashing. In [26], the tools CVC4, Z3Str2 and S3 are evaluated in which S3 is found to be 5 times slower than Z3Str2 and crashed on about 4.5% of problems in the Kaluza [27] benchmarks. We ran all of the selected tools on Kaluza and Stringfuzz-generated [28] benchmarks, as well as on benchmarks consisting of problem instances pertinent to type inference in Remora [9], [10], a dependently typed array processing language. The type of an array term in Remora encodes the shape of the array as a list of dimensions (natural numbers). Our work was motivated by the problem of inferencing these shapes which reduces to solving string equations. For example, suppose that X has dimensions [a 3]b and Y has dimensions b[3]z, where a is a single dimension, while b and z are lists of dimensions, and juxtaposition indicates concatenation. If X and Y are used in a context where they must have the same dimensions, then for the program to be well-typed, we require that the string equation a3b = b3z is satisfiable. One solution is b = [ ], z = [3] and a = 3, in which case X and Y are 2-dimensional matrices with shape [3 3].

We used all of the problems in the above mentioned benchmarks that were in the extended sequence theory, thus, excluding problems in Kaluza that used other constructs. This allows us to evaluate only our contribution, the string solver, not the underlying solvers. In total, we have 1,178 problems, of which 903 are sat problems and 275 are unsat problems. We cross-verified the tools and for all benchmark problems, all tools that gave definitive answers agreed on the classification of the problem. All experiments were performed on the same machine, which was running macOS Catalina 10.14.6 with a 2.7GHz Intel Core i5 CPU and 8 GB of memory. The timeout

Fig. 1. Performance of SeqSolve, CVC4, Z3-Str3, Norn, Sloth, Trau and Z3-Str2 on solved benchmarks across all three benchmark sets.

for each problem was set to 60 seconds. Figure 1 shows the results of the performance evaluation, using what we call a *ray plot*. Ray plots are designed to visually depict the results of the evaluation in as simple a way as possible. On the x-axis we have the expected number of problems solved and on the y-axis we have the expected time in seconds. Suppose you want to determine how long it will take to solve n benchmark problems, say 800; just look at the line x = 800 and you will see that SeqSolve will take about 100 seconds, CVC4 will take over 2,000 seconds, Z3Str3 will take just under 12,000 seconds, Norn will take about 5,500 seconds and Z3Str2 can only solve about 500 problems, so it will never solve 800 problems. Symmetrically, if you want to determine how many problems you can expect to solve in t seconds, just look at the line y = t. This is a simpler plot than a cactus plot, which shows similar information, but with problems ordered, on a per-tool basis, from easiest to hardest. These orderings can vary significantly from tool to tool and there is no way for a user of the tool to determine how easy or difficult a problem will be, so it is not clear what benefit there is to this extra complexity. It is easy to generate ray plots; just run all the benchmark problems and draw a ray from the origin to the (p, t) coordinate, where p is the number of problems solved and t is the time taken. This is equivalent to shuffling the problems many times and taking the average of the running times for the shufflings.

In Table I, we show a table version of the experimental evaluation. Tuples under "Solved" give the number of problems solved for the Stringfuzz-generated, Kaluza and handcrafted benchmarks, respectively. In addition to the time in seconds, we also show the number of problems for which solvers returned unknown, timed out or returned incorrect result (X). We ran the tools without giving them a timeout and our scripts killed jobs that were taking too long, but some

TABLE I PERFORMANCE OF SOLVERS ON ALL BENCHMARKS


tools returned unknown before timeouts occurred. Notice that SeqSolve beats all the other string solvers in terms of the standard ordering, which is based on first the number incorrect results, then on the number of problems solved and finally on the time taken.

*Acknowledgements:* We thank Andrew Walter for integrating Z3 with ACL2s, which was indispensable.

#### VIII. CONCLUSION AND FUTURE WORK

We introduced a new non-deterministic, branching transition system, TranSeq, for deciding the satisfiability of conjunctions of string equations and length constraints. TranSeq extends the MPMT framework for combining decision procedures and we prove that it is both sound and complete. We implemented a prototype, SeqSolve, which is based on TranSeq and resolves non-deterministic choices in a way designed to infer as much as possible with as few computational resources as possible. We evaluated SeqSolve by comparing it with existing tools on a suite of benchmark problems and found that SeqSolve solved more problems and was faster than existing solvers. In our ongoing work, we plan to extend the scope of TranSeq so that it supports richer classes of constraints. We also plan to reason about the implementation, as it is mostly written in ACL2s, which is built on top of the ACL2 theorem prover.

#### REFERENCES


# Lookahead in Partitioning SMT

Antti E. J. Hyvarinen ¨ USI, Switzerland *antti.hyvaerinen@usi.ch* Matteo Marescotti Facebook, UK *mmatteo@fb.com*

Natasha Sharygina USI, Switzerland *natasha.sharygina@usi.ch*

*Abstract*—Lookahead in propositional satisfability has proven effcient as a heuristic in pre- and in-processing, for partitioning instances for parallel solving, and as the main driver of a standalone solver. While applying similar techniques in satisfability modulo theories is potentially equally useful, adapting lookahead to learning theory clauses and to estimating search space sizes in the presence of frst-order structures is not straightforward. This paper addresses both of these observations. We give a hybrid algorithm that integrates lookahead into the state-based representation of an SMT solver and show that in the vast majority of cases it is possible to compute full lookahead up to depth four on inexpensive theories. We also show the role of frst-order structures in SMT search space: while in most of our benchmarks the partitions are easier to solve than the original instance, we identify cases where lookahead results in sequences of increasingly diffcult instances for a computationally expensive theory.

# I. INTRODUCTION

Large scale parallel SMT solving that would result in linear speed-up reliably over any instance in a cloud environment is a lucrative prize that has been intensively studied over the recent years [26], [14], [13], [17]. A central sub-goal in this project is in understanding how to apply successfully the *cubeand-conquer* [24] approach in SMT solving. The lookahead heuristic in propositional logic [27], in addition to being effcient in solving certain types of structured problems [8], has recently proven to be a powerful tool in constructing partitions for divide-and-conquer-based parallel SAT solvers [10], [9]. The idea is to base the search-space traversal on the explicit principle of branching on literals that reduce maximally the remaining search space. In addition to SAT solvers, the heuristic has been implemented in SMT solvers such as Z3 [20], where it serves for in- and pre-processing, and by us in OpenSMT [11], [12] as an alternative implementation for the main SAT solver.

This paper studies how the literals chosen by lookahead algorithm for SMT affect the diffculty of the instance from the perspective of a standard CDCL-based SMT solver. This question is central to divide-and-conquer-style parallel SMT solving, where the lookahead heuristic is used to build a binary *lookahead tree* of depth d, with nodes labeled by the literals chosen with the lookahead heuristic, and root labeled with the true literal ⊤. Conjoining the literals in each rooted path to the leaves with the original instance produces 2 <sup>d</sup>−<sup>1</sup> partitioned instances that do not share models. The resulting instances can be solved in parallel, and the original instance is satisfable if and only if one of the partitioned instances is satisfable.

Our main contributions are rigorously defning what we mean by lookahead heuristic for an SMT solver, and an experimental study on how the use of this heuristic affects the diffculty of the partitions. In defning the heuristic, we show that lookahead can be integrated tightly into a CDCL(T) style algorithm that fully leverages learned clauses, including determining unsatisfability while constructing partitions. We summarize our experimental results as follows. First, in many cases the heuristic runs in seconds when producing a nontrivial number of partitions (say, 16). This is already a nontrivial observation given that the full lookahead heuristic in SAT is known to be in most cases prohibitively expensive. Second, usually the approach results in partitions that are easier to solve than the original. While this result seems rather implicit and obvious, it is made interesting by the next observation: There are instances where the above described lookaheadbased parallel algorithm's run time *increases* compared to the original instance even when no overhead from partitioning or communication is considered, and the number of partitions is in the thousands. We show some details on the latter cases that help to understand the underlying phenomena, and identify a possible reason arising from the way the theory solving algorithm for linear real arithmetics is implemented in most SMT solvers. These cases serve to illustrate the complexity of the ultimate goal of an effcient and general parallel solver.

Combining a lookahead algorithm with a CDCL-based SMT solver in a meaningful way is not straightforward. First, the lookahead heuristics assumes that the clauses of an instance are known at computing time. In contrast, an SMT solver produces a new clause whenever a propositional model is inconsistent in the theory. A potentially very large number of clauses remain invisible for the heuristic. Second, the explanation clauses guide the search through non-chronological backtracking. This means that the heuristic scores of variables change with each backtrack, and the algorithm may determine unsatisfable entire sub-trees of the lookahead tree. The subtrees need to be re-computed to ensure that the approach produces 2 <sup>d</sup> partitions. Finally, it is not clear how SMT solver's theory specifc reasoning part interacts with the lookahead-heuristic that only measures the reduction in the propositional space.

To the best of our knowledge, this paper is the frst to build lookahead partitioning into the SMT framework in a way that observes the search space reduction resulting from learned clauses, and guarantees the unit-propagation consistency of the resulting partitions in case instance satisfability is not de-

termined. We consider the theories of uninterpreted functions with equality [3] and linear real arithmetic [4]. These are the two central algorithms that constitute, together with a SAT solver, the core of most SMT solvers. Combinations of these two theories with pre-processing techniques are capable of handling the quantifer-free subset of the SMT-LIB benchmark library instances. The algorithm either produces exactly 2 d−1 instances none of which can be shown unsatisfable through (theory-aware) unit-propagation in the current state of the SMT solver; or shows the original instance either satisfable or unsatisfable. The partitioning algorithm compromises in certain cases the exactness of the lookahead scores for decreased run time. We believe that the effciency of our proof-of-concept implementation forms a solid basis for future research in this direction. Since the approach also sheds light to the observed slowdowns, we believe that the work will prove useful for designing more general parallelization algorithms for SMT.

The paper is organized as follows. After discussing related work, in Sec. III we defne our SMT-related logical notation. In Sec. IV we adapt the rule-based description of SMT from [25] to the specifc case of lookahead and introduce a running example. In Sec. V we present our lookahead partitioning algorithm, then provide experimental results in Sec. VI, and conclude in Sec. VII.<sup>1</sup>

#### II. RELATED WORK

The lookahead heuristic was frst introduced in the context of DPLL-based SAT solving in [27]. The original idea uses the number of propagated literals as a measure of search space reduction [23], and is further extended to consider, e.g., equivalence reasoning [5], the clause-based Jeroslow-Wang heuristic [16], and approaches for choosing which variables to consider for lookahead [7].

Lookahead as a pre- and in-processor for clause-learning SAT solvers was formalized in [6]. However, it was not integrated into the CDCL algorithm in the sense that is done in this work. A similar pre- and in-processing approach was recently implemented for the SMT solver Z3 [20]. When used as a pre- and in-processor for an ordinary, CDCL-based solver, the lookahead implementation can be conceptually fairly straightforward. Lookahead is not directly involved in the CDCL search, and therefore the artifacts related to nonchronological backtracking need not be necessarily considered. In [12] we formalized an algorithm inspired by the lookahead heuristic for solving quantifer-free frst-order formulas based on CDCL SMT solving. The approach is implemented in our SMT solver OpenSMT [11] and was shown experimentally to be effcient for solving linear integer arithmetic problems with Boolean structure. Compared to the publication, in the current work we give a more formal treatment of the implementation,

<sup>1</sup>An extended version of the paper, available at https: //usi-verifcation-and-security.github.io/opensmt-doc/publications/

lookahead-in-partitioning-smt-extended.pdf, provides an appendix detailing some of the optimizations we implemented for the lookahead approach, further experiments, and a comparison to an alternate scoring for the lookahead algorithm.

defne the lookahead algorithm for partitioning, and provide experimental data and analysis for parallel solving based on cube-and-conquer.

Our focus is in how SMT lookahead can implement partitioning in divide-and-conquer for parallel solving. The idea was introduced for parallel SAT solving in [10], and an implementation for parallel SMT solving was used in [13], [17]. However, the details of this partitioning approach have not been discussed before. The lookahead-based partitioning implementation in [10] applies essentially lookahead-based binary partitioning recursively. The downside of this design is that it does not use the full information in the CDCL solver, and producing the partitions might miss an unsatisfability high up in the tree. As a result it construct partitions that are known to be unsatisfable in an intermediate state of the partitioning algorithm.

The substantial amount of research in SAT heuristics, overviewed in [1] from the perspective of parallel solving, provides a promising foundation for partitioning in SMT. Recent relevant approaches include [15], where the authors recognize high-level information that can be used for better clause learning.

#### III. PRELIMINARIES

The *Satisfability Modulo Theories* (SMT) problem [22], [3] consists of determining whether a propositional formula is satisfable, given that some of the atoms have an interpretation in frst-order logic. A *confict-driven clause learning* (CDCL) SMT solver searches frst for propositional models, which are then checked for consistency with respect to the theory. If found inconsistent, the propositional structure is enriched with an *explanation*, that is, a clause containing in general theory atoms. If instead during the process the propositional part becomes unsatisfable, the solver has shown the whole formula unsatisfable. The formula is satisfable if the solver fnds a theory-consistent model.

*1) SMT solving:* This section fxes the notation for frstorder logic and SMT. We defne sets of function symbols, terms, constants, and predicate symbols as usual, the last containing the special symbols ⊤, ⊥, and = that represent, respectively true, false, and equality. We call applications of predicate symbols on terms *atoms*. Let U be a possibly infnite set of elements containing at least the truth values true and false. A *model* M assigns to each constant a unique element from U, to each function symbol of arity n ≥ 1 a total function U <sup>n</sup> → U, to each predicate symbol of arity zero a truth value true or false, and to each predicate symbol of arity n ≥ 1 a total function U <sup>n</sup> → {true,false}. An *interpretation* A is the extension of M to general terms in the usual sense.

Given a fnite set of atoms At, a *clause* is a set of *literals*, that is, positive and negative atoms x, ¬x, x ∈ At. We extend the negation to clauses, and write ¬(l<sup>1</sup> ∨ . . . ∨ ln) for ¬l<sup>1</sup> ∧ . . .∧¬ln. A *propositional formula in conjunctive normal form* (CNF) is a conjunction of clauses. Throughout the text we use both a set of literals and disjunction, and a set of clauses and a conjunction, interchangeably. We also treat conjunctions of unit clauses (*cubes*) as sets of literals when this cannot be confused with a disjunction. A sequence of literals is written l<sup>1</sup> . . . ln, and when the order plays no role, we equate the sequence with the corresponding set {l1, . . . , ln}.

A set of literals X is *consistent* if for no x both x ∈ X and ¬x ∈ X. A consistent set σ is called an assignment. An assignment is *total* if for all atoms x ∈ At either x ∈ σ or ¬x ∈ σ. An atom x is *assigned* if either x ∈ σ or ¬x ∈ σ. The assignment σ satisfes a clause c when σ ∩ c ̸= ∅, and a formula ϕ if it satisfes all clauses of ϕ. A *theory* T is a non-empty set of models. A CNF formula ϕ is T*-satisfable* if (i) there exists a satisfying total assignment σ for ϕ and an interpretation A that is an extension of a model M ∈ T, and (ii) for each l ∈ σ, l <sup>A</sup> ≡ true if l is of the form x; and l <sup>A</sup> ≡ false if l is of the form ¬x, where x is an atom of ϕ. In particular, given a formula ϕ and an assignment σ that is total (with respect to ϕ), we write σ |=<sup>T</sup> ϕ if σ is such an assignment. In addition we write ϕ ′ |=<sup>p</sup> ϕ if all assignments that satisfy ϕ ′ also satisfy ϕ propositionally, and |=<sup>T</sup> c if c is entailed by the theory, that is, a *theory lemma* of a theory T. For a formula, clause, literal, or assignment ξ we denote by Ats(ξ) the set of atoms appearing in ξ.

In this work we study two theories: the theory of linear real arithmetic (LRA) and the theory of uninterpreted functions with equality (EUF). The universe of LRA consists of real numbers, function symbols ∗ and + of arity two restricted to expressing linear terms, and the predicate symbol ≤; all three have their usual interpretations. The EUF theory places no restrictions on the interpretations of constants, functions, or predicates (apart from the inherent ones for equality, ⊤, and ⊥).

*2) Parallel SMT solving:* Given an SMT instance ϕ, *partitioning* produces instances ϕ1, . . . , ϕ<sup>k</sup> such that the satisfability of ϕ is equal to the satisfability of the disjunction ϕ1∨. . .∨ϕk. In addition, we are interested in partitionings such that no two partitions ϕ<sup>i</sup> , ϕ<sup>j</sup> , i ̸= j, share a total satisfying assignment. The *partitioning approach* Part(k) consists of solving an SMT instance ϕ by frst constructing the partitions ϕ1, . . . , ϕk, and then solving each resulting partition ϕ<sup>i</sup> in parallel until one of them is shown satisfable, or all of them are shown unsatisfable.

# IV. CONFLICT-DRIVEN CLAUSE-LEARNING LOOKAHEAD IN SMT

The *CDCL lookahead algorithm* intuitively guides an SMT solver in a binary tree, using the solver's state to determine how to expand the tree. To more precisely describe the algorithm, we adapt here the rule-based presentation of CDCL(T) from [25], [21] to our needs. As usual, in the frst phase an input SMT formula is converted into an equisatisfable propositional formula ϕ in CNF while preserving the atoms in the theories T. The *state* ⟨σ | F⟩ of an SMT solver consists of σ, an initially empty assignment, and F, a set of clauses initially consisting of ϕ. The execution of the solver proceeds according to a set of rules described below. In general, the algorithm alternates between *propagation*, choosing a *decision literal*, denoted by x δ , and analysing conficts found in propagation. The labels L and E refer to *learned* and *explanation* clauses. When they appear on the left side of ·−→, the corresponding rule matches only to clauses that have the label.


A CDCL(T)-based SMT solver works by applying the above rules with two restrictions. (i) The solver always computes the *unit propagation closure* before deciding a new literal, i.e. the rule *Dec* is never applied if the rule *Prop* is applicable; and (ii) to notice any theory inconsistencies when a propositional assignment is found, if the rule *Dec* cannot be applied (i.e., all atoms are assigned) the solver applies the rule *TProp*. The solver always terminates if both the rules *Reset* and *Forget* are applied with an increasing interval [2].

Since the unit-propagation closure has a central role in computing lookahead, we give here two useful, related defnitions in the above notation. Given a solver state ⟨σ | ϕ⟩, the *unit* *propagation closure* UP(σ, ϕ) is the set of literals σ ′ ⊇ σ, where ⟨σ ′ | ϕ⟩ is the state obtained by applying the rules *Prop* and *TProp* until neither one applies. A solver state ⟨σ | ϕ⟩ is called *unit propagation consistent* or *consistent* if the set UP(σ, ϕ) is consistent.

The following running example illustrates the use of the rules. The notation *Prop*<sup>∗</sup> indicates a sequence of propagations.

*Example 1:* Consider the conjunction F = ( ¬x ∨ (b ≤ c) )(1) ∧ ( ¬x ∨ (a ≤ b) )(2) ∧ ( ¬(a ≤ d) ∨ ¬(a ≤ b) ∨ ¬(a ≤ c) )(3) ∧ ( (c ≤ d) ∨ ¬(b ≤ c) ∨ (a ≤ d) )(4) ∧ ( (c ≤ d) ∨ ¬(a ≤ d) ∨ (a ≤ c) )(5) where the numbers in parentheses label the clauses. The following is a possible computation of the CDCL(T) system.

$$\begin{array}{c@{}c@{}c@{}c} \langle \emptyset \mid F \rangle \stackrel{\scriptstyle Dec}{\longrightarrow} \langle x^{\delta} \mid F \rangle & \xrightarrow{Prop^{\*}} \langle x^{\delta} (b \le c)(a \le b \mid F) \stackrel{\scriptstyle Dec}{\longrightarrow} \langle x^{\delta} \rangle \\ \langle x^{\delta} (b \le c)(a \le b) \neg (c \le d)^{\delta} \mid F \rangle & \xrightarrow{Prop^{\*}} \\ \langle x^{\delta} (b \le c)(a \le b) \neg (c \le d)^{\delta} (a \le d) \neg (a \le c) \mid F \rangle & \xrightarrow{P \to p} \\ \langle x^{\delta} (b \le c)(a \le b) \neg (c \le d)^{\delta} (a \le d) \neg (a \le c) \mid \\ F \wedge \left( (c \le d) \lor \neg (b \le c) \vee (a \le c) \right)^{E} \rangle & \xrightarrow{BJ} \\ \langle x^{\delta} (b \le c)(a \le b) \mid F \wedge C\_{1}^{L} \rangle & \end{array}$$

where the learned clause, obtained by resolution, is C L 1 := (c ≤ d ∨ ¬b ≤ c ∨ ¬a ≤ b) <sup>L</sup>. Continuing the example, we get

$$\xrightarrow{TProp} \langle x^{\delta}(b \le c)(a \le b)(c \le d)(a \le c) \mid F' \rangle.$$

where F ′ := F ∧ C L <sup>1</sup> ∧ ( ¬(a ≤ b) ∨ ¬(b ≤ c) ∨ (a ≤ c) )L , the last being a valid clause in the theory, and

$$\begin{array}{l} \xrightarrow{Prop^\*} \langle x^\delta (b \le c)(a \le b)(c \le d)(a \le c) \neg (a \le d) \mid F' \rangle\\ \xrightarrow{T \to p} \langle x^\delta (b \le c)(a \le b)(c \le d)(a \le c) \neg (a \le d) \mid\\ F' \land \left( \neg (a \le c) \lor \neg (c \le d) \lor (a \le d) \right)^E \rangle\\ \xrightarrow{BJ} \langle \neg x \mid F' \land \neg x^L \rangle \end{array}$$

where ¬x <sup>L</sup> is obtained through a resolution derivation on clauses in F ′ and the explanation.

#### V. LOOKAHEAD-BASED PARTITIONING FOR SMT

This section describes the lookahead-based algorithm for partitioning an SMT instance into 2 <sup>d</sup> partitions or determining whether the instance is satisfable.

#### *A. The Lookahead Score*

Lookahead in a backtracking search consists in general of repeated trial and backtracking on all available branches at a certain point of the search, and committing to the one that seems most promising. We defne the relation between SMT solver states before and after the trial branch, and the lookahead score as the difference between the two. The approach is oblivious to the details on how the lookahead score between two states s and s ′ is defned. Our implementation supports two scoring functions, one based on the number of free atoms in the instance globally [23], and the other on unassigned atoms in the clauses of the instance [8]. Our examples and experiments in this paper use the former.

Lookahead aims to assign with the rule *Dec* the literal that minimizes the upper bound for the remaining search space. Given a state s where neither *Prop* nor *TProp* applies, we defne the *lookahead step* on a literal l as the sequence of rules starting from s, having *Dec* on l as the frst rule, followed by unit propagation closure computation resulting in the state s ′ , and fnally an *Undo* on l ending in state s. This sequence is not always possible, and we describe in Sec. V how we handle the failed cases. For a consistent state ⟨σ | ϕ⟩, the set UP(σ, ϕ) is unique. Therefore we can defne the lookahead score of a literal l based on a difference between ⟨UP(σ, ϕ) | ϕ⟩ and ⟨UP(σl, ϕ) | ϕ⟩. We denote the *lookahead score* of literal l by score(l) = |UP(σ∪ {l}, ϕ)\UP(σ, ϕ)|, that is, the number of propagated literals after deciding l, and extend the defnition to atoms x as

$$score(x) = \min\left(score(x), score(\neg x)\right),\tag{1}$$

which minimizes the sum of the upper bounds for the remaining search spaces [23].<sup>2</sup>

#### *B. Lookahead-Based Partitioning*


The approach is presented in Alg. 1. The algorithm constructs a tree with nodes labelled with literals. The tree is constructed depth-frst using the stack, with the help of a CDCL(T) SMT solver s. The intuition is that the tree is being built by guiding the SMT solver along the rooted paths and lookahead heuristic is used to expand a leaf node. The

<sup>2</sup>There are other defnitions for lookahead score, but they all favor atoms that minimize the remaining search space on both polarities [8].

algorithm limits the search depth to the input value d, and is also a sound but incomplete (if |Atsϕ| > d) SMT solver.

Let n <sup>i</sup> denote a node n at depth i in the tree. Then each path in the tree from the root n 0 to a leaf n i corresponds to a partition as follows. We label the nodes n with a literal Lab(n), and n 0 is labelled Lab(n 0 ) = ⊤. A path n 0 . . . n<sup>i</sup> is interpreted as a cube, and n 0 . . . n<sup>d</sup> in the tree corresponds to the partition ϕ ∧ Lab(n 0 ) ∧ . . . ∧ Lab(n d ).

The main work, done in the loop between lines 6 – 22, consists of two phases: *setting the solver* s *to a given node* on Line 8, and *expanding the lookahead tree* on Line 14. We describe both phases, referring to the rules in Sec. III.

*1) Expanding the lookahead tree:* The lookahead tree is expanded with new nodes c, c′ by the function expandTree on Line 14. Using the solver s the function computes the lookahead step for each literal x, ¬x not assigned in σ as described in Sec. V-A. The process may be interrupted by three special conditions:

	- If *BJ* becomes applicable with l <sup>δ</sup> = x or l <sup>δ</sup> = ¬x, the function does a *local restart*: it forgets the computed lookahead scores and restarts the lookahead computation.
	- If *BJ* is applicable with l <sup>δ</sup> = y or l <sup>δ</sup> = ¬y for some earlier decision literal y ̸= x, the function does a *complete restart* by returning BackJump.

If expandTree determines satisfability, the algorithm terminates and reports the result immediately. The distinction between local and complete restarts is motivated by effciency and has deep implications to the algorithm. We discuss this point in Sec. V-B3.

*2) Setting the solver to a given node:* A lookahead path obtained from the stack is used to set the solver s to the correct state where the lookahead scores of literals can be computed. This is done in Line 8 by the call to the function setSolverToNode that takes as arguments the solver s = ⟨σ | F⟩, and the current node n = n k . The function initially applies the rule *Reset* on the solver, and computes the unit propagation closure at the root by σ = UP(∅, F). Then, for each n 0 . . . n<sup>k</sup> the function applies *Dec* with l = Lab(n i ), and sets σ = UP(σl, F). The process may be interrupted in two cases:


Otherwise, setting solver to the node succeeds and the algorithm proceeds with expanding the tree.

To clarify the behavior of the algorithm, we show its execution on the running example (Example 1).

*Example 2:* Let ϕ = F from Ex. 1 and d = 2 for Alg. 1. The algorithm advances to line 14 to compute the lookahead scores of the variables using solver s. No conficts are detected by s, literal x propagates {b ≤ c, a ≤ b}, and literals ¬b ≤ c and ¬a ≤ b propagate {¬x}. No other branch results in propagations. Hence the score from Eq. (1) is zero for all atoms.

Say the algorithm expands the tree, that up to now consisted only of the empty root, with nodes labeled ¬x, x, and pushes both nodes to the DFS stack. Assume that the algorithm frst branches on ¬x. None of the free literals propagate, and tree is expanded for example with ¬a ≤ d and a ≤ d. Once these are popped from the stack, the tree would consist so far of branches ( ¬x(a ≤ d) ) , ( ¬x¬(a ≤ d) ) , and (x).

The algorithm will now pop x on line 7. On line 14, during the execution of the lookahead heuristic, the algorithm will do the lookahead step on b ≤ c. This triggers the confict-handling sequence shown in Ex. 1 resulting in the solver state ⟨¬x | F ∧ ( (c ≤ d) ∨ ¬(b ≤ c) ∨ ¬(a ≤ b) )L ∧ ( ¬(a ≤ b) ∨ ¬(b ≤ c) ∨ (a ≤ c) )L ⟩. Backjump is on the earlier decision literal a ≤ c, not on the most recent decision literal b ≤ c (see the description above for expandTree), and therefore expandTree will return BackJump, restarting the tree construction.

The algorithm builds now the tree similar to the frst time, but when computing lookahead in state ⟨x(b ≤ c)(a ≤ b)(c ≤ d)(a ≤ c)¬(a ≤ d) | F ′ ⟩ there are no free variables, and the algorithm reports satisfability.

*3) Observations on the backjumps:* The backjump during the above execution is critical for the partition quality. It is relatively easy to see that applying recursively a lookahead algorithm on the original problem, as in [10], produces partitions that in a later state of the solver would not be unit-propagation consistent.

First, one could imagine a version of the algorithm that backtracks to the level indicated by the backjump, similar to the underlying SMT solver. This choice would intuitively result in less repeated work as the previously built lookahead tree would be preserved, and therefore conceivably in a more effcient algorithm. However, there are two reasons why the restart is necessary. First, a clause c learned in a backjump at expandTree on node n i alters the lookahead scores in an unpredictable way in the solver states closer to the root. The current lookahead tree becomes in general invalid from the heuristic perspective. Without the restart, the clause should be considered in all previous invocations of expandTree at least in the nodes n 0 . . . n<sup>i</sup>−<sup>1</sup> , and tracking such propagations would be expensive. Second, allowing backjumps in the lookahead tree means that when setting the solver to a new node (Line 8), a learned clause can cause a confict not present when the node was pushed (lines 20 and 21). In this case it is unclear how the algorithm should proceed to construct the balanced binary tree with consistent partitions.

The distinction between local and complete restarts stems from the above two observations. Complete restarts are too expensive to be performed on every confict, a relatively common event during the lookahead computation. Instead, they are done only on the long backjumps that are rare in lookahead-based branching. The consequence of having the local restarts is that setSolverToNode may result in a confict. While this introduces a performance overhead, it turns out to be very rare and therefore insignifcant in practice.<sup>3</sup>

We still recompute the lookahead scores in a local restart, since the error caused by omitting this may grow very large, as shown by this example where not recomputing the lookahead after a confict would mis-calculate a literal's score with a maximum possible error.

*Example 3:* Consider the following derivation, where a lookahead at ⟨σ | G⟩ on x d fails with the learned clause (c ∨ ¬x) L:

$$\begin{array}{l} \langle \sigma x^{d} \mid G \rangle \xrightarrow{PExp} \langle \sigma x^{d} \mid G \wedge c'^{E} \rangle \xrightarrow{BJ} \langle \sigma \neg x \mid G \wedge (c \vee \neg x)^{L} \rangle \\ \xrightarrow{Prop} \langle \sigma \neg x \sigma' \mid G \wedge (c \vee \neg x)^{L} \rangle . \end{array}$$

Assume now that G has as a subformula (x∨v∨p1)∧. . .∧ (x∨v ∨pn)∧(x∨ ¬v ∨q1)∧. . .∧(x∨ ¬v ∨qn), where p<sup>i</sup> , q<sup>i</sup> and v do not appear in Ats(σ ′ ). Then the lookahead score of v at ⟨σ | G⟩ is 0 but in the state ⟨σ¬xσ′ | G ∧ (c ∨ ¬x) <sup>L</sup>⟩ the score is n. Note that n is upper bounded by |Atsϕ| which in our scoring is also the highest heuristic value.

*4) Correctness and termination:* We fnish the discussion with proofs on correctness and termination for Alg. 1

*Theorem 1:* The algorithm either determines the satisfability of the instance or constructs a balanced binary tree with each rooted path leading to the leaves corresponding to a unitpropagation consistent SMT instance.

*Proof.* The correctness of the Sat and Unsat results reported by the algorithm follow immediately from the observation that the result is obtained by modifying the solver state with the rules outlined in Sec. IV. Each rooted path of the tree corresponds to a unit propagation consistent instance. This follows from two observations. First, if setSolverToNode succeeds on a node n, the instance corresponding to the node is unit propagation consistent. Second, if expandTree succeeds, similarly by construction the instances corresponding to the nodes c and c ′ are consistent. The resulting tree is balanced, since unless the execution terminates in lines 9, 15, or 16, the algorithm performs a DFS with a cutoff at depth d. □

*Theorem 2:* The algorithm terminates.

*Proof.* The procedure setSolverToNode terminates since it performs a sequence that is bounded by the depth of the node and consists of rules *Dec* and unit propagation closure computations that both terminate. The procedure expandTree terminates in quadratic number of applications of *Dec*, *Undo* and unit propagation closure computations: the computation consists of lookahead steps each bounded by the number of atoms |Ats(ϕ)|. The local restart at a node n can be done at most |Ats(ϕ)| times, since each related backjump will assign at least one atom in the truth assignment of the solver state at node n.

Fig. 1. Runtime for lookahead partitioning to 16 for QF LRA and QF UF. Labeled boxes and crosses refer to specifc instances discussed below. Unsatisfable instances are denoted with boxes (⊡), and satisfable with crosses (×).

The restarts in tree construction on lines 11 and 18 will not cause non-termination since the solver state is persistent (modulo possible applications of *Reset*) over such restarts. Following [18], the assignments of the solver together with the literals can be seen as a fnite ordered sequence that is increased by every backjump and has a maximum element where every atom is assigned with no decision literals. □

#### VI. EXPERIMENTS

We report experiments on our implementation on the nonincremental benchmark divisions QF UF and QF LRA of SMT-LIB.<sup>4</sup> The two divisions are chosen since they constitute the foundation of most other SMT logics and allow us to directly observe the behaviour of the congruence closure (egraph) and the Simplex algorithms under lookahead. All the experiments were run using the SMT solver OpenSMT [13]. The partitions are constructed with the implementation of Alg. 1, and, when applicable, solved with OpenSMT's default CDCL(T) engine running the VSIDS heuristic [19], a setup similar to most CDCL(T) solvers. The CPU time consumed by the experiments is slightly under 338 CPU days. We used a Linux cluster, equipped with two Intel Xeon E5-2650 v3 @ 2.30GHz CPUs, yielding (2 × 10) cores per node. Each node has 64GB of DDR4@2133MHz memory. We ran at most ten solvers on each node simultaneously, limiting the memory available for a solver to 4GB. The time out was 7200 s for both the partitioning and solving, except in Fig. 2 where the timeout was 1200 s. We frst report on the effciency of the partitioning implementation, and then show that the partitioning in general works well. Finally we study instances showing a slowdown anomaly. All times are given in seconds and refer to wall-clock times.

*1) Lookahead partitioning effciency:* The plots in Fig. 1 illustrate the run times of Alg. 1 on the QF LRA and QF UF

<sup>4</sup>The benchmarks are available at https://clc-gitlab.cs.uiowa.edu:2443/ SMT-LIB-benchmarks under commit hash 33961bc4.

Fig. 2. Comparing sequential and Part(2) run times for QF LRA (*top*) and QF UF (*bottom*). On the top fgure the boxes pointed to by the arrows are from Part(64) and show the approach effcient. The effciency for QF LRA is studied separately.

instances when partitioning into 16. The instances are ordered based on the run time. We only report the instances not solved during partitioning. The implementation is effcient in particular for QF UF, where the maximum stays in the majority of cases within a few seconds. The lookahead on QF LRA is much more involved, perhaps due to the more expensive theory solving. Our implementation partitions 98% of the benchmarks within two hours, showing that the approach is realistic.

*2) Effect of partitioning on instance diffculty:* To measure how partitioning affects the instance diffculty, we study instances that OpenSMT can solve between 100 and 1000 seconds sequentially, a range where parallelization is useful but the baseline can still be computed within a reasonable time. This resulted in 13 instances for QF UF and 144 instances for QF LRA. The reported times do not include partitioning.

Figure 2 compares Part(2) to sequential solving for QF LRA (*top*) and QF UF (*bottom*). We plot the line y = x corresponding to no speed-up, and the dashed line y = 2x corresponding to two-fold slowdown. The dashed horizontal and vertical lines in the top fgure show the timeout of 1200 seconds. Crosses (×) and boxes (⊡) indicate satisfable and unsatisfable instances, respectively.

Except for three cases, Part(2) provides a consistent speedup in QF UF. We ran these instances in Part(64) and each became easer to solve than the original instance (as shown by the downwards arrows that point to the corresponding Part(64) measurement). As a conclusion, it seems that lookahead is effcient when combined with the congruence closure algorithm. This is somewhat expected since lookahead is effcient in purely propositional solving, and the congruence closure algorithm is scalable.

It is interesting to compare these results to QF LRA, where lookahead is effcient in 60% of the instances, but we also observe signifcant slowdowns, corresponding to up to 6-fold increase in run time. Repeating the experiment of partitioning with Part(64) did not result in a positive result similar to QF UF (see fgures 3 – 4), suggesting that this phenomenon has a different origin.

The partitioning run times for the anomalies are shown with the labels in Fig. 1. Typically their run times are above the average.

*3) Slowdown analysis for partitioning:* Despite Part resulting in most cases in a consistent speed-up, the signifcant slowdowns in QF LRA warrant a separate study, as it poses a threat for lookahead partitioning in SMT. We label with (a) – (i) in Fig. 2 (*top*) nine instances where the run time more than doubles. We removed the randomness common in heuristic search by solving each partition several times with the OpenSMT VSIDS engine while changing the branching heuristic's random seed. We refer to this approach as the *simulated parallel solver*.

We ran as a pre-processing phase Part(k) for k = 2, 4, 8, . . . , 2048 for the instances (a) – (i) and stored the resulting partitions if the instance was not solved by Part. As a result of time outs and one of the instances being solved during partitioning, we could run the full experiment set only for the instances (a), (d), and (f). We concentrate on these three instances since they seem representative for the others as well.

Figure 3 (*top*) shows run times for the simulated parallel solver on the only satisfable instance (f). While the slowdown is consistent for Part(2), we observe speedup for Part(k), k ≥ 4. Figure 3 (*bottom*) shows the simulated parallel median run times on instance (d). The partitions are easy only once a big number, 1024, is reached. We show in addition run time ranges (green bars) and medians (blue starts) for the individual partitions. The instance (i) behaves similarly to this. Figure 4 shows the results for the instance (a), where the minimum, median, and maximum run times consistently increase. We show also the individual Part runs as yellow boxes. Instances (b), (c), (e), (g), and (h) behave similarly to (a). While the lookahead clearly identifes easier partitions, the hardest partitions seem to get more diffcult. In particular Figs. 3 (*bottom*) and 4 show a signifcant amount of partitions having the median time higher than the sequential median. The slowdown can be argued to result in part directly from these partitions.

The slowdown, affecting not uniformly all instances, seems

Fig. 3. Scalability for a satisfable instance (*top*) and partition diffculty for an unsatisfable instance (*bottom*). The horizontal axis refers to number of partitions produced, and the vertical axis to run time in seconds.

to be the result of an intricate interaction between lookahead and the incremental Simplex implementation typically used in SMT solvers [4]. The implementation maintains an internal model for its real valued variables that satisfes all currently asserted inequalities. If a new inequality is not satisfed in the model, this triggers the pivoting sequence of Simplex that is in the worst-case exponential. SMT solvers try to avoid this behavior by branching as much as possible on inequalities that are consistent with the model. Because of lookahead, Simplex is sometimes forced to follow such a sequence, causing the increasing run times for some of the partitions. It is a natural further question how to generalize lookahead to mitigate or avoid these cases.

To conclude, we note that the lookahead partitioning produces in the vast majority of cases very balanced partitions and good speed-up. Nevertheless, the instance run times increase in a signifcant portion of the benchmarks. In the studied SMT-LIB benchmark divisions, we observed slowdown only for QF LRA. We believe that it is possible to obtain speedup also for these instances by developing a version of the lookahead heuristic that considers also the confguration of

Fig. 4. Scalability and partition diffculty for an unsatisfable instance. The horizontal axis refers to number of partitions produced, and the vertical axis to run time in seconds.

the theory solvers run inside the SMT solver.

#### VII. CONCLUSIONS

We present an algorithm for partitioning SMT with lookahead based on CDCL(T) calculus and show experimentally that the approach is highly promising. We also demonstrate that the classical propositional lookahead is not in general suffcient in SMT, where the theory reasoning engines may unexpectedly interfere with lookahead heuristic's view of the search space. In particular we found that in combination with Simplex as implemented in many SMT solvers, lookahead partitioning sometimes creates instances that are increasingly diffcult to solve.

In future we plan to extend the lookahead heuristic to better consider the theories. In parallel, we will also study lookahead partitioning in a more applied setting, including theory combinations and non-convex theories, when new atoms are introduced.

*Acknowledgements.* This research was supported by the Swiss National Science Foundation grant number 200021 185031.

#### REFERENCES


# A Multithreaded Vampire with Shared Persistent Grounding

Michael Rawson and Giles Reger University of Manchester

*Abstract*—Automated theorem provers (ATPs) typically run in a single thread. Hardware parallelism is then exploited through *portfolios*, in which distinct and disjoint strategies are launched as fully-independent processes and do not cooperate. Whilst there has been some historic exploration of cooperation, the technical challenge has prevented this from being fully explored in modern ATPs. The following describes the non-trivial engineering effort required to make the Vampire theorem prover multithreaded, such that multiple proof attempts coexist in the same memory space. This lays the foundations for a new generation of proof search techniques able to cooperate with other proof attempts running in parallel. As an initial demonstration, we implement a shared *persistent grounding* daemon that receives all clauses generated by all proof attempts and checks whether a heuristically-grounded version is unsatisfable. The resulting multi-threaded system achieves limited contention compared to the previous process-based implementation, and persistent grounding improves performance in certain cases.

#### I. INTRODUCTION

Whilst parallel computational resources have become abundant and used with effect in many areas of computer science, they are yet to make a signifcant impact on automated theorem proving. We have seen substantial developments in SAT solving [1], [2], [3] and progress within SMT [4], [5], [6] but, to date, parallel automated theorem proving is typically historic with no modern implementation [7], [8], [9], or parallel at the level of portfolios without shared memory. The popularity of parallel portfolios is likely due to their ease of implementation and practical impact: it is common folklore that a good way to combat explosive proof search is a set of complementary search strategies. This success goes some way to explaining why research in other directions has been slow.

In this paper we discuss our initial work on a new sharedmemory architecture for the VAMPIRE automated frst-order theorem prover [10]. VAMPIRE is a saturation-based theorem prover that implements the superposition calculus [11] as its main mode, but also contains routines for instance-based reasoning [12] and fnite model building [13]. It has won frst place in the main track of the CASC competition for over 20 years [14] and implements advanced reasoning techniques for theory reasoning [15], [16], [17], inductive reasoning [18] and higher-order reasoning [19]. It consists of over 200k lines of C++ with contributions from over 15 developers and a permissive licence [20]. As such, it is a mature and highlycomplex piece of software.

Since 2010, VAMPIRE has supported some form of multiprocess parallelism where a portfolio of predetermined (and automatically generated) *strategies* (sets of proof search heuristics) could be implemented by forked processes. This achieves good results, but limits options for cooperation between proof attempts due to reliance on inter-process communication. In 2015, we proposed a concurrent architecture [21] that interleaved proof attempts within a single process whilst sharing (some) memory to explore a novel method for cooperation. Our conclusion at the time was that we needed true shared-memory parallelism to make progress.

We experienced two main diffculties with such an approach in VAMPIRE. The frst is that it is diffcult to implement correctly: this is a well-known feature of parallel programming, and we discuss our approach and experience below. The second is *contention*, which for our purposes is negative performance impact caused by multiple threads using the same resource simultaneously, typically by having to wait for a lock held by another thread. Avoiding contention requires careful design of shared-memory schemes within an ATP.

A reasonable line of questioning raised in review asks whether it would be easier to start from scratch. It would probably be technically easier to do so: however, ATP systems at VAMPIRE's level of maturity take signifcant time to develop, even with the beneft of hindsight, so instead we offer pragmatic suggestions to convert existing systems.

The two main contributions of this paper are (1) A detailed discussion of the technical challenges and experience involved in transitioning a complex, mature theorem prover from a process-based model to a thread-based, shared-memory architecture (Section II), and (2) A new *persistent grounding* technique designed to take advantage of the shared memory concurrency provided by the architecture (Section III).

### II. CHALLENGES AND EXPERIENCE

This section refects on the engineering challenges we faced when converting Vampire into a multi-threaded solver, and the approach we took to overcome them. We include this discussion to provide guidance for others attempting to complete a similarly-challenging task. Currently, the implementation is available in a branch of the VAMPIRE repository<sup>1</sup> .

#### *A. Design*

The architecture is based on the previous process-based architecture, which has not previously been described elsewhere. As illustrated in Fig. 1, the input problem is frst parsed into a set of initial formulas over a signature (that is, the symbols

<sup>1</sup>https://github.com/vprover/Vampire/tree/caps

Fig. 1. Schematic of Architecture.

appearing in the problem) shared between all proof attempts. A strategy scheduler uses a portfolio of strategies to generate a set of k threads. The parent scheduler supervises the child threads, reporting success if any child succeeds and spawning new threads to keep available CPU cores busy. Each thread preprocesses the problem, potentially extending the signature by e.g. introducing names for subformulas, and then performs proof search. This typically involves the use of complex data structures (*term indices*) for storing and searching for relevant clauses. VAMPIRE's complex custom memory allocator is disabled for this work, incurring a small performance hit.

Two complex parts of the architecture are currently protected by a coarse-grained lock. Only one proof attempt should print a proof, so this process is gated such that subsequent successful attempts block forever. A more diffcult issue is *term sharing*. Part of the standard VAMPIRE is a hash-consing structure used to implement perfect term sharing, i.e. avoid duplication of terms. This is very convenient as it allows rapid identifcation of terms by pointer comparison, a property which is assumed throughout VAMPIRE. In our multithreaded architecture we share this structure and protect it by a lock. Term sharing must be able to distinguish between terms built solely from the shared signature and terms involving threadspecifc symbols: that is, terms that could appear in any attempt versus terms that only have meaning in a single attempt.

#### *B. Approach*

Converting a large, complex and performance-sensitive system such as VAMPIRE to work in thread-parallel is not especially easy. The approach outlined previously [21] in which proof attempts *interleave* in a single thread of execution, rather than exist concurrently, at frst seemed like a good intermediate step before starting work on a fully thread-parallel, sharedmemory system. However, we found that bugs introduced by interleaved proof attempts were very diffcult to track down, not least because very often they had no observable effect.

Instead we take a more chaotic approach, leaning heavily on tooling for developing multi-threaded applications, particularly tools for detecting *data races*. Data races, for our purposes, are execution scenarios in which two threads access shared memory without synchronisation, and at least one access is a write. Detection of races is extremely useful in our case as it provides a good proxy for identifying when one proof attempt infuences the execution of another. Nearly all thread-related bugs — of which there were many — could then be squashed by examining the context in which races occur and introducing synchronisation or data reorganisation where appropriate.

Tools for detecting dubious constructs and execution states in low-level programming have improved signifcantly. We were particularly impressed by the LLVM-based [22] linter *clang-tidy* [23], which helped to identify and remove existing discouraged constructs in VAMPIRE's codebase, and the *ThreadSanitiser* [24] compiler instrumentation for the detection of data races. Armed with these tools, we simply introduced threads into VAMPIRE and waited for the tool reports. Races happened frequently in VAMPIRE at frst, where code written under the implicit assumption of single-threaded execution breaks down, triggering a ThreadSanitiser report.

In general, data races tend to lead to crashes rather than unsound behaviour but to avoid the latter we rely on (i) existing mechanisms for automated testing utilising large sets of labelled benchmarks [25], and (ii) VAMPIRE's support for proof checking which allows us to independently verify the correctness of proof search [26].

### *C. Thread-Local Storage, Atomics and Locking*

The most common source of the races was the re-use of heap-allocated temporaries such as stacks or maps, often used in iterative translations of recursive algorithms present throughout the system. Reusing these values once allocated can improve performance in the single-threaded case by avoiding repeated (de)allocations. The majority of such cases can be resolved by the use of thread-local storage as a compromise, incurring one allocation per thread. The 2011 C++ standard [27] provides a thread\_local keyword and associated machinery.

Another problem area is integer counters, often used for computing statistics and satisfying freshness constraints such as "select a fresh symbol for the Skolem function". Usually the only operation required is "read-and-increment", but this must sometimes be refected across threads to maintain soundness of e.g. Section III. This operation can be safely achieved atomically: C++'s <atomic> proved useful here.

Only surprisingly rarely was a full lock required to synchronise compound operations. This relatively-coarse technique was only required for widely-used modules with non-trivial internal invariants such as the implementation of term sharing. Due to the small number of locks, deadlock was mostly avoided.

### *D. Data Organisation and Partitioning*

Signifcant headaches can be avoided by carefully choosing which data are shared between proof attempts. A clever implementation could aggressively share all common data using very fne-grained synchronisation. For example, VAMPIRE maintains various term indices to quickly retrieve various syntactic data that satisfy some condition, like "retrieve all the literals that unify with L". In principle it would be possible to share at least some of these and save some memory, but in practice this is enormously diffcult to implement correctly and effciently. However, we remain interested in parallel term indices and may investigate these independently in future.

Currently, each proof attempt maintains its own clause space, computed properties and statistics, indices, introduced defnitions, and ground reasoning systems such as those used in global subsumption [28] or AVATAR [29]. They do however share synchronised access to creating fresh symbols (although not all symbols are used in all proof attempts), term sharing, and persistent grounding (Section III). We feel this is a good initial trade-off.

#### *E. Timing and Internal Control*

One crucial difference between the multi-processing and multi-threading approaches to portfolio modes is that processes can be signalled to stop execution in a timely manner, whereas most threading abstractions do not have this ability. Threaded proof attempts must therefore frequently check for exit conditions, e.g. another proof attempt succeeded/time is up. Making these checks can be tricky: too frequently and there will be some performance impact; too infrequently and user experience or portfolio performance will begin to degrade. VAMPIRE executes a series of loops in its internal search routines: each iteration of these loops can take drastically different lengths of time depending upon the input problem.

#### *F. Synchronisation and Performance*

All the synchronisation measures introduced do incur some performance impact. Atomic operations are not quite *free*, but are very close in practice. Thread-local storage requires some checks for lazy initialisation, which can occur frequently if the compiler is unable to elide them, and is therefore not as cheap as we would like. VAMPIRE uses a global "environment" structure which was made thread-local: C++ semantics mean that this is considerably more effcient if an extra level of indirection is added such that the environment is accessed via thread-local *pointer*. Locks are currently a major bottleneck: while contention was expected to be high, another problem is that the locked sections are typically relatively short and inexpensive compared to the locking overhead. We will investigate fner-grained locking and alternative locking strategies in future.

#### *G. Experimental Evaluation*

To validate the resulting system we carry out two experiments using the 500 frst-order problems from the 2020 frstorder theorem division of CASC. All experiments in this paper

TABLE I EVALUATING SCALABILITY OF THREADED ARCHITECTURE.


are run for 60 seconds per problem on a Ubuntu desktop machine with an 8-core CPU<sup>2</sup> and 16GB RAM.

Firstly, we compare the new thread-based architecture with the previous process-based implementation. The threadbased architecture solves 413 problems (10 uniquely) and the process-based architecture solves 424 problems (21 uniquely). The slight degradation in performance is unsurprising given the additional contention in the thread-based approach. The symmetric difference refects the sensitivity of VAMPIRE to variations in timing and memory usage. On average, the new thread-based architecture took 1.25x longer to solve problems. However, this is heavily infuenced by short-running problems. Excluding problems solved in under 1s, the slowdown is 1.02x.

Secondly, we examine the scalability of the thread-based solution using the same set of problems whilst varying the number of threads. The results are in Table I. The number of problems solved peaks between 2 and 6 threads. We achieve approximately-linear speedup with 2 and then 4 threads, but then plateau (based on the total time taken to solve the 352 problems solved by all attempts). The average solution time overall was the lowest for 6 threads — the lower average solution times for the intersection of solved problems suggests that these were the easier problems.

In summary, performance degrades slightly when replacing processes by threads (most likely due to contention) but the overhead is acceptable (∼ 2% on longer running problems).

#### III. PERSISTENT GROUNDING

As a frst step to explore the benefts of the new architecture, we introduce a lightweight form of clause sharing. All clauses produced by all proof attempts are grounded, shared, and passed to a SAT solver to detect a form of *global* inconsistency, i.e. an inconsistency in the ground abstraction of the full search space explored by all proof attempts, past and present.

The idea of grounding the search space of a frst-order prover in an attempt to detect inconsistency is not novel [30], [31] and some methods, such as instance generation [12] perform grounding as part of proof search already. What is new in our approach is the *persistence* of the grounding: grounded clauses escape from and outlive their thread, allowing clauses from different proof attempts to interact.

#### *A. Extension to Architecture*

We introduce a queue (synchronised by single lock) that proof attempts add produced (and grounded) clauses to and a

$$^{12}\text{Tntcl}^{\otimes}\text{ }\text{Corr}^{\text{ru}}\text{ i7-6700 }\text{CPU}\text{ }\text{@ }\ 3.40\text{GHz}$$

thread that loops, adding the grounded clauses to the MiniSAT solver [32] — yielding if the queue is empty — and checking for unsatisfability. If the grounding is inconsistent the thread will report this immediately, interrupting other threads. Currently, full proof printing is not implemented and only the unsatisfable core of grounded frst-order clauses is identifed. It is work-in-progress to rebuild the derivations that produced these clauses as a separate post-processing step.

We maintain a mapping from (grounded) frst-order literals to SAT literals such that a fresh frst-order literal leads to a fresh SAT literal, with the mapping stored for later. This mapping relies on the shared term indexing structure to effciently identify atoms that are shared between proof attempts, ensuring they are represented using the same SAT variables.

#### *B. Grounding Choices*

There are numerous ways in which we could choose to ground frst-order clauses. We implement three alternatives:


Where the input problem is multi-sorted the above constants are selected per-sort. We compute constant frequency on the problem before preprocessing i.e. before subformulas are copied or reduced.

#### *C. Experimental Analysis*

We use the same 500 problems and experimental setup as above to analyse the impact of this new addition. Our frst experiment is to isolate the impact of persistent grounding from threading by running with a single thread. In this setting, we solve 399 problems without persistent grounding and 398 with (using the fresh grounding) but with a symmetric difference of 11 problems — persistent grounding allows us to solve 5 problems we did not solve without it. Some problems were also solved signifcantly faster: for 8 problems the speedup was > 2×, with one problem (SWB105+1) solved 15× faster (from 25s to 1.6s).

Next, we compare the different grounding mechanisms (using 6 threads). The results are given in Table II (top 4 rows). The frst observation is that we solve 8 problems that we did not solve without persistent grounding, and each grounding mechanism solves some problems uniquely.

However, the average time to solve each problem increases. The fresh grounding mechanism fares the worst with the common grounding mechanism producing proofs more than a second before other mechanisms 5 times. Within this there are some notable interesting cases. For example, GRP667+1 was solved using input in 15s whilst others failed to solve it using persistent grounding and it was eventually solved in the normal way after 50s. Similarly, ITP006+4 was solved using common in 9s rather than the 25s elsewhere.

TABLE II PERSISTENT GROUNDING EVALUATION.


We explore two further variants (rows 5–7 of Table II): in *active-only* we restrict persistent grounding only to socalled *active* clauses [10] and in *no-splitting* we turned clause splitting off for all strategies. Clause splitting introduces additional (per proof attempt) propositional literals into split clauses, potentially reducing the amount of sharing between proof attempts. Active-only solves more problems and (not shown in the table) enjoys a slight reduction in solving times in cases where persistent grounding is not used to solve the problem. Turning clause splitting off solves fewer problems but is nicely complementary (solving 5 problems uniquely).

In summary, the persistent grounding method can drastically speed up proof search when it fnds a proof but it generally adds a noticeable overhead. Overall, we solve 12 problems with variants of persistent grounding that we were unable to solve without it. The main observation is that it is possible to prove more by sharing information between proof attempts than simply running the union of proof attempts separately but more work is required to make this approach effcient.

#### IV. REFLECTION AND FUTURE WORK

We describe our initial efforts transforming VAMPIRE to a multi-threaded architecture and show how this new shared memory architecture can easily support methods for clause sharing. Whilst the concepts involved are straightforward, the engineering effort required to transform a mature codebase from a process-based single memory architecture to a threadbased shared-memory one is large. We have described our experience for others. Our general fndings are:


The new shared persistent grounding method gave lacklustre results but only represents a frst step in a number of opportunities presented by the new architecture. Directions we plan to pursue in the future include:


#### ACKNOWLEDGEMENT

This work was funded by EPSRC project EP/V000209/1: *CAPS: Collaborative Architectures for Proof Search*.

#### REFERENCES


The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing.